Workflow 4 - Creating Label Sets for the CNN Classification Task

- London Borough and London Wide Scales

Aim

  • To Create Image Buckets for each Classification Task Label Run variation
  • To Create Machine Learner Friendly Image Classification Labels
  • To Create interesting, descriptive and meaningfull Building Typology Sets and allocate Set Membership to the Building Data.

Method


Setup

In [2]:
## Load Librarys and Jupyter User Settings
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:760 !important; }</style>"))
#Run Multiple Commands in One Cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
#Get All DataFrame Columns
import pandas as pd
pd.set_option('display.max_colwidth', -1) 
import numpy as np
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import matplotlib.pyplot as plt
#import matplotlib.patches as mpatches

from IPython.display import Markdown, display, HTML
from IPython.display import Image as Img

# Widgets
from __future__ import print_function
from ipywidgets import interact, interactive, fixed, interact_manual, Layout, Button
import ipywidgets as widgets

#%matplotlib inline
In [2]:
#See Appendix for Custom Helper Scripts
#%run plot_helper.py
In [3]:
#Code Toggle Add In
#%run code_toggle.py
In [4]:
## Thumb Gallery Helper
def css_styling():
    styles = open('custom_thumbs.css', 'r').read()
    return HTML(styles)
css_styling()    
    
## Table Label Helper
def printmd(string):
    display(Markdown(string))
Out[4]:

Label Maker Tools

1: Load and Explore Data Model for Candidate Use Case and Label Set Scenarios Click Here

 For Identifying Candidate Boro Use Cases and Label Set Combinations


2: Create Labels using Data Filtering and Fuzzy Controls Click Here

-Create CNN Labels for London Building Types
-Appendices Cells for post script Iterative Varied Sample Size Runs, Secondary Munging and Pre Pre Processing -  for Troubleshooting Model OverFit


3: Create Image Buckets/Folders. Click Here

Create Folders Using SHUTILS and OS Libs


4: Create Auxilary Descriptor Label Sets Click Here

 Accuracy and Speed Results on all Label Sets Trained, for all Paremter, Hypter Parameter and Architecture configurations.

1.1 - Get and Check the Master Data as CSV File:

In [7]:
#Load CSV
#Includes stripping for NULLs
df_ldd = pd.read_csv('/Users/anthonysutton/ml2/_DATA_LABELS/LDD_STORE/LDD_MODEL/AB_LDD.csv',  sep=',', 
                           error_bad_lines=False, index_col=False,  na_values=['.'], encoding="ISO-8859-1")
# Occasional Second Clean up required with these large python created csvs: #df_ldd.dropna() #Visual Check of DataFrame: df_ldd.shape df_ldd.head(1)# Arrange Cols by Alpha, Easier for Exploring df_ldd = df_ldd.reindex(sorted(df_ldd.columns), axis=1) df_ldd.head(1)

1.1.b - Filter on Building Completions Only:

In [8]:
df_ldd.Current_pe.unique()
Out[8]:
array(['Completed', 'Started', 'Not started'], dtype=object)
In [9]:
# Not Completed = No SV Image and therefore problem for the Classification Task
# A Reduction of 13k = Significant hit to our total train test image set!
df_ldd[df_ldd['Current_pe'] =='Completed'].shape
Out[9]:
(49863, 93)
In [10]:
df_ldd = df_ldd[df_ldd['Current_pe'] =='Completed']
df_ldd.head(2)
In [11]:
# Also remove demolitions
df_ldd = df_ldd[~df_ldd['Developmen'].str.contains('demolition')]
df_ldd.shape
Out[11]:
(47557, 93)
In [8]:
# Just get the cells of interest:
df_ldd =  df_ldd[['TempID', 'Planning_A', 'Resi_Site_Prop_ha','Non_Resi_Site_Prop_ha','Total_Open_ Space_Exist_ha',
               'Total_Open_ Space_Prop_ha', 'Total_Site_Area_Prop_ha', 'Existing_Total_Residential_Units',
               'Proposed_Total_Residential_Units','Proposed_TotalAffordable_Units','Proposed_Total_Affordable_Percentage',
               'Proposed_Residential_Parking_Spaces','Cash_in_Lieu_Affordable_Housing','Existing_Total_Bedrooms',
                   'Proposed_Total_Bedrooms', 'Existing_Total_Floorspace', 'Proposed_Total_Floorspace',            
               'ClassificationCode','Latitude','Longitude','Developmen', 'Primary_St', 'Site_Name_', 'Postcode_Join']]

#df_ldd.head(1)

1.1.c - Remove Missing SV Images from Model:

In [13]:
#Load CSV
#Includes stripping for NULLs
df_snow = pd.read_csv('ldd_snow.csv',  sep=',', 
                           error_bad_lines=False, index_col=False,  na_values=['.'], encoding="ISO-8859-1")
In [14]:
#London Complete Clean
import os
import csv
#import pandas as pd
count = 0
locations = []
with open('ldd_snow.csv') as csvfile:
    
    reader = csv.reader(csvfile)
    
    for row in reader:
        count = count + 1
        image = str(row[0])
        #print(count)
        image_str = os.path.basename(image)
        y = image_str.replace('col_id_','')
        z = y.replace('.jpg','')
        #print(os.path.basename(z))
        locations.append(z) 
print('done')
done
In [15]:
print(df_ldd.shape)
df_ldd = df_ldd[~df_ldd['TempID'].isin(locations)]
print(df_ldd.shape)
(47557, 93)
(45197, 93)

1.2 - Get, Join and Check AddressBase Classification Schema

In [16]:
#AddBase Schema
#Load CSV
#Includes stripping for NULLs
df_ab_schema = pd.read_csv('LookUps/AB_Schema.csv',  sep=',', 
                           error_bad_lines=False, index_col=False,  na_values=['.'], encoding="ISO-8859-1")
In [17]:
#df_ab_schema.shape
#Some Extra Nulls Maybe WhiteSpace Needs Cleaning
#df_ab_schema.dropna()
df_ab_schema.shape
Out[17]:
(563, 11)
In [18]:
#Join Schema to Master Data
df_ldd_ab_sch_join = pd.merge(df_ldd, df_ab_schema, on='ClassificationCode' , how='left')
df_ldd_ab_sch_join.shape
#df_ldd_ab_sch_join.head(1)
#df_ldd_ab_sch_join
#df_ldd_ab_sch_join.to_csv('df_ldd_ab_sch_join.csv', index=False)
Out[18]:
(45197, 103)

1.3 - Get, Join and Check Auxilary Data :

Note:

For convenience we retain the Acorn CACI, UK Census, Place Pulse and other Auxillary Property Descriptors in a seperate file to avoid loading up too much data. Our initial exploration of the data in search of suitable Labels for the CNN Classification Task will focus on the LDD and OS Addressbase and their Property/Address Classification schemas.

In [19]:
#ACORN - Only Load When creating Auxilary Label Sets
#Load CSV
#Includes stripping for NULLs
df_acorn = pd.read_csv('/Users/anthonysutton/ml2/_ACORN_EXPLORER/ACORN_LONDON_ONLY.csv',  sep=',', 
                           error_bad_lines=False, index_col=False,  na_values=['.'], encoding="ISO-8859-1")
In [20]:
#Join the 2 x tables
df_ldd_acorn_join = pd.merge(df_ldd_ab_sch_join, df_acorn, on='Postcode_Join' , how='left')
#df_ldd_acorn_join = pd.merge(df_ldd, df_acorn, on='Postcode_Join' , how='left')


df_ldd_acorn_join.shape
#df_ldd_acorn_join.head(1)
#df_ldd_Filter1 = df_ldd_acorn_join
#df_ldd_Filter1.head(1)
#df_ldd_Filter1 = []
Out[20]:
(45197, 111)

1.4 - Get, Join and Check Non LDD Dataset for "Old and New London" Classification Label Attempt:

In [ ]:
# Warning: This file is Large (>250k) -> Only Use for Live Runs
df_non_ldd = pd.read_csv('/Users/anthonysutton/ml2/_DATA_LABELS/LDD_STORE/LDD_MODEL/nonLDD.csv',  sep=',', 
                           error_bad_lines=False, index_col=False,  
                         usecols = ['Field1', 'Postcode', 'SubBuildingName', 'BuildingName',   'ClassificationCode'], 
                         na_values=['.'], encoding="ISO-8859-1")

df_non_ldd.shape df_non_ldd.head(1)

In [17]:
#print(df_non_ldd.dtypes)
#print(df_ldd.dtypes)
#print(df_ldd_acorn_join.dtypes)
#print(df_ldd_Filter1.dtypes)

Case Study Selection

For early run throughs of the Project Workflow we limited Label Set Creation and Classification runs to a limited number of London Boroughs and focused on areas with which we were familiar. This "Ground Truth" approach would assist greatly in validating address data and in troubleshooting and assessing labelling approaches to achieving optimal CNN training accuracy.

When choosing our candidate Boroughs, we are looking for a blend of good amounts of data coupled with data that evenly represents the full cross section of London Building Types. When choosing our candidate Boroughs, we are looking for a blend of good amounts of data(100s and 1000s) coupled with data that evenly represents the full cross section of London Building Types.

1.5 - View and Identify Boroughs of Interest:

In [18]:
printmd('<br><br><b>`Number & Type of Properties Built Since 2004 By Borough`</b> &#9663;<br><br>')
df_selector = df_ldd_acorn_join.groupby(['Planning_A', 'Primary_Desc'])['Planning_A'].count().unstack('Primary_Desc').fillna(0)
#df_selector = df_ldd_acorn_join.groupby(['Primary_Desc']).count().unstack('Primary_Desc').fillna(0)
df_selector.plot(kind='bar', stacked=True, figsize=(18, 10));
#fig = df_selector.plot(kind='bar', stacked=True, figsize=(28, 12)).get_figure()
#fig.savefig('Workflow_images/Number_Type_Prop1.png')



`Number & Type of Properties Built Since 2004 By Borough`

In [19]:
#df_selector
In [20]:
data = df_ldd_acorn_join.groupby('ClassificationCode').filter(lambda x : len(x)>100)
df_selector = data.groupby(['Tertiary_Desc']).count()
#df_selector_labs = df_ldd_acorn_join.groupby(['Primary_Desc']).count().unstack('Primary_Desc').fillna(0)
df_selector.shape
#df_selector_labs
df_selector = df_selector['Planning_A']
#df_selector
#df_selector = df_ldd_acorn_join.groupby(['Planning_A', 'Primary_Desc'])['Planning_A'].count().unstack('Primary_Desc').fillna(0)



#df_selector_labs = df_selector.index.values
#df_selector_labs

fig = df_selector.plot(kind='bar', stacked=True, figsize=(15,8)).get_figure()
fig.savefig('Workflow_images/Number_Type_Prop.png')
Out[20]:
(16, 41)

Note:


Here it is apparent that boroughs such as Westminster, Croydon and Barnet have large amounts of New Development Property Data. However we also need to know how this translates into spread of property types and the quality of this address and image data, and so we will explore these aspects of the data model before we proceed with creation of our sets of CNN Training Labels.

Also we will get ignore Misc, unclassified, parent shell(address admin) etc. Commerical and Residential.

First off, Lets filter the data model according to our selected boroughs of interest:

1.6 - Select Boroughs of Interest

In [22]:
Borough_Selection = df_ldd_acorn_join.Planning_A.unique()
In [ ]:
boro_name = df_ldd_acorn_join.Planning_A.unique()
Borough_Selector = widgets.SelectMultiple(
    options= boro_name,
    #value='2',
    description='',
    disabled=False,
    rows=10,
    layout=Layout(width='50%', height='100%')
    )   
    
#widgets.HBox([widgets.Label(value="<b>Select (CTRL+ for Multiple) Boroughs:</b>"), Borough_Selector])
printmd("<div class='alert alert-block alert-info'><b>Select Borough:</b> (ctrl+ for Multiple Select) &#9662;")
display(Borough_Selector)
#printmd("</div>")
#Borough_Selection = Borough_Selector.value

1.7 - Apply User Selection to Master Property Dataframe:

In [25]:
#Check Dataframe
#df_ldd_Filter1.shape
#df_ldd_Filter1.head(1)
Borough_Selection = Borough_Selector.value
printmd("<div class='alert alert-block alert-info'>You have selected:<b> " + str(Borough_Selection) + "</b></div>")
You have selected: ('Croydon',)

Borough Selection:

You have selected: ('Bexley', 'Camden', 'City of London', 'Enfield', 'Greenwich', 'Hackney', 'Islington', 'Kensington and Chelsea', 'Lewisham', 'Redbridge', 'Newham', 'Merton', 'Southwark', 'Richmond upon Thames', 'Waltham Forest', 'Tower Hamlets')

London Wide Selection:

You have selected: ('Bexley', 'Barking and Dagenham', 'Barnet', 'Brent', 'Bromley', 'Camden', 'Croydon', 'City of London', 'Enfield', 'Ealing', 'Greenwich', 'Hackney', 'Haringey', 'Hammersmith and Fulham', 'Harrow', 'Havering', 'Hounslow', 'Hillingdon', 'Kensington and Chelsea', 'Islington', 'Kingston upon Thames', 'Lewisham', 'Lambeth', 'Redbridge', 'Newham', 'Merton', 'London Legacy DC', 'Southwark', 'Sutton', 'Richmond upon Thames', 'Waltham Forest', 'Wandsworth', 'Tower Hamlets', 'Westminster')

In [24]:
#Run 55
#Borough_Selection = ('Bexley', 'Camden', 'City of London', 'Enfield', 'Greenwich', 'Hackney', 'Islington', 'Kensington and Chelsea', 'Lewisham', 'Redbridge', 'Newham', 'Merton', 'Southwark', 'Richmond upon Thames', 'Waltham Forest', 'Tower Hamlets')
In [25]:
Borough_Selection =  ('Bexley', 'Barking and Dagenham', 'Barnet', 'Brent', 'Bromley', 'Camden', 'Croydon', 'City of London', 'Enfield', 'Ealing', 'Greenwich', 'Hackney', 'Haringey', 'Hammersmith and Fulham', 'Harrow', 'Havering', 'Hounslow', 'Hillingdon', 'Kensington and Chelsea', 'Islington', 'Kingston upon Thames', 'Lewisham', 'Lambeth', 'Redbridge', 'Newham', 'Merton', 'London Legacy DC', 'Southwark', 'Sutton', 'Richmond upon Thames', 'Waltham Forest', 'Wandsworth', 'Tower Hamlets', 'Westminster')

Apply Borough Selection Filter to Data Model

In [26]:
#Apply Filter DF by Borough Selection
df_ldd_Filter1 = df_ldd_acorn_join.loc[df_ldd_acorn_join['Planning_A'].isin(Borough_Selection)]
df_ldd_Filter1.shape
df_ldd_Filter1.columns
Out[26]:
(45197, 42)
Out[26]:
Index(['TempID', 'Planning_A', 'Resi_Site_Prop_ha', 'Non_Resi_Site_Prop_ha',
       'Total_Open_ Space_Exist_ha', 'Total_Open_ Space_Prop_ha',
       'Total_Site_Area_Prop_ha', 'Existing_Total_Residential_Units',
       'Proposed_Total_Residential_Units', 'Proposed_TotalAffordable_Units',
       'Proposed_Total_Affordable_Percentage',
       'Proposed_Residential_Parking_Spaces',
       'Cash_in_Lieu_Affordable_Housing', 'Existing_Total_Bedrooms',
       'Proposed_Total_Bedrooms', 'Existing_Total_Floorspace',
       'Proposed_Total_Floorspace', 'ClassificationCode', 'Latitude',
       'Longitude', 'Developmen', 'Primary_St', 'Site_Name_', 'Postcode_Join',
       'Concatenated', 'Class_Desc', 'Primary_Code', 'Secondary_Code',
       'Tertiary_Code', 'Quaternary_Code', 'Primary_Desc', 'Secondary_Desc',
       'Tertiary_Desc', 'Quaternary_Desc', 'Unnamed: 0', 'Postcode',
       'Large User', 'Deleted', 'Acorn Category', 'Acorn Group', 'Acorn Type',
       'Description'],
      dtype='object')

Note

  • From an original unprocessed dataset of 66284 Building Developments we have a London Wide set of 45197 records.

1.8 - Inspect Use Class Category Example Size and Distribution of our data selection:

In [27]:
# Just get list of labels for inspecting add base building schema types?
#labels = df_ldd_Filter1['ClassificationCode'].value_counts().reset_index(name="count").query("count > 50")
#labels = df_ldd_Filter1['class_label'].unique()

#printmd('<br><br><b>`Number and Type of Properties Built Since 2004`</b> &#9663;<br><br>')
#plotdat_label(df_ldd_Filter1,'Tertiary_Desc')
#fig = df_selector.plot(kind='bar', figsize=(18, 10)).get_figure()
#fig.savefig('Workflow_images/Number_Type_Prop.png')
#printmd("<b>`Selected Boroughs:" + str(Borough_Selection) + '`</b><br><br><br><br>')


data = df_ldd_Filter1.groupby('ClassificationCode').filter(lambda x : len(x)>100)
l=data.groupby('Tertiary_Desc').size()
l.sort_values()
fig=plt.figure(figsize=(20,5))
l.plot(kind='bar',fontsize=12,color='k')  
plt.xlabel('Property Type',)
plt.ylabel('Number of ...',fontsize=10)
plt.show()
fig.savefig('Workflow_images/Number_Type_Prop_Desc.png')

    
    
Out[27]:
Tertiary_Desc
Electricity Sub-Station                                  109  
Hotel/Motel                                              113  
Additional Mail / Packet Addressee                       148  
Public House / Bar / Nightclub                           166  
Care / Nursing Home                                      186  
Workshop / Light Industrial                              194  
Warehouse / Store / Storage Depot                        208  
HMO Bedsit / Other Non Self Contained Accommodation      235  
Restaurant / Cafeteria                                   311  
Development Site                                         417  
Office / Work Studio                                     1279 
Shop / Showroom                                          1756 
Semi-Detached                                            1905 
Detached                                                 2622 
Terraced                                                 3563 
Self Contained Flat (Includes Maisonette / Apartment)    16814
dtype: int64
Out[27]:
<matplotlib.axes._subplots.AxesSubplot at 0x10b329c88>
Out[27]:
Text(0.5,0,'Property Type')
Out[27]:
Text(0,0.5,'Number of ...')

Use Class Distribution

  • As we might expect we see the largest amounts of data in the Residential categories, with RD06 (Flats) eclipisng the other other Resi House Type Classes(RD02 to RD04) and the smallest numbers to be found in the Comercial Industrial Class(Co01).
  • Nevertheless we have sufficent data to attempt a CNN Classification Task with the Inception Vs architecture(n>100), and when the sparser populated sets are combined we appear to have a reasonable data representation of the cross section of London building types.
  • We will also need to ensure we have equal amounts of data from each selected category when selecting our Training Data, to ensure the better represented categories do not skew our image data/building type data sample.
    • One point of interest is that there are large numbers of address in the generic Commerical and Residential classes, and also several well populated misc categories that represent addresses that do not deal deal with Physical Buliding Types(L, PP, R, U). We will omit these from our study although it will be interstng to see what StreetView images are associated with these address types.
    • Note on Local Authority Street Name and Numbering Statutory Function and the adminsitrative and systemic challenges faced in creating and maintaing accurate address data on our changing, evolving cities and towns.

Note:

- Mostly Residential - see london plan etc
- Need to Standardize/Normalize Class Type Samples. 
- Commerical and Residential Generic Groupings = 1000s
- More granular label differentation = 100s
-  Therfore we will try both approaches

1.9 - Map Address Base Secondary Class Descriptions to Generic(Primary Class) Type

In [28]:
df_ldd_Filter1.loc[:,'class_label'] = df_ldd_Filter1['ClassificationCode'].astype(str).str[0]
In [29]:
printmd('<br><br><b>`Primary Class(Adressbase Schema):  C = Commercial, R = Residential`</b> &#9663;<br><br>')
df_ldd_Filter2 = df_ldd_Filter1[['ClassificationCode','class_label', 'Class_Desc']]
df_ldd_Filter2 = df_ldd_Filter2[(df_ldd_Filter2['class_label'].str.contains('C')) |
                                (df_ldd_Filter2['class_label'].str.contains('R'))]
df_ldd_Filter2 =  df_ldd_Filter2.groupby('ClassificationCode').filter(lambda x : len(x)>100)
df2 = df_ldd_Filter2.groupby(['class_label', 'Class_Desc'])['class_label'].count().unstack('Class_Desc').fillna(0)
df2.plot(kind='bar', figsize=(18, 10));



`Primary Class(Adressbase Schema): C = Commercial, R = Residential`

Note:

- Group Resi and Commericial
- Where sample sizes are slim, can we combine class types 
    e.g. Detached, Semi-Detached, Terrace = All Members ofthe 'House' Class
In [30]:
df_ldd_Filter1.shape
#df_ldd_Filter1.head(1)
Out[30]:
(45197, 43)
In [31]:
df_ldd_Filter1.loc[:,'class_label'] = df_ldd_Filter1['ClassificationCode'].astype(str).str.slice(stop=2)
In [32]:
df_ldd_Filter1.shape
#df_ldd_Filter1.head(4)
Out[32]:
(45197, 43)

Note:

- As we might expect numbers of Residential Building Types eclipse those from the Commercial cateogory.
- Some Classes feature only 1 hundred
- Within each class there are different degrees of class member. we will use statuory planning typlogy(major/minor) and also some fuzzy controls to create classes with the greater typlogical concentration.
- Our objective is to achieve the greatest accuracy rate from our ML Image Classification Task
- we therfore create 2 kinds of label set, one that groups secondary and teritiary types into a more generalized typeconcerned with large numbers of Image, the other concerned at representing building class types in a more grnular detail

1.10 - Create/Use AddressBase Use Class Lookup Tool.

  • Use this tool to explore the Address Base BS7666 Class Schema
In [27]:
df_ab_schema["LookUp"] =  df_ab_schema['Concatenated'] +  ':  ' + df_ab_schema['Class_Desc'] \
    +  '----> ' + df_ab_schema['Primary_Desc'] 
In [ ]:
##Code LookUp
df_ab_sch_slice_np = df_ab_schema.LookUp.unique()

addbase_scheme_Selector =    widgets.Select(
    options= df_ab_sch_slice_np,
    #value='2',
    description='',
    disabled=False,
    rows=10,
    layout=Layout(width='50%', height='100%')
    )   
#widgets.HBox([widgets.Label(value="<b>Select (CTRL+ for Multiple) Boroughs:</b>"), Borough_Selector])
printmd("<div class='alert alert-block alert-success'><b>Look Up Table for AB Class Schema Descriptions:</b> " +
        "&#9662;")
display(addbase_scheme_Selector)
printmd("</div>")

1.11 - Visual Inspection of Image Data for All Address Base Use Class Types

Random Image Inspection of 2 x Images from each Property Type group

Now that, caveats withholding, we are sure our data model is a fair representation of the Address Types that we are interested in, and that we we have enough Candidiate Records, Lets Load and View some images of buildings from each of these groups, to check the quality of the image data. The Thumb Gallery below lists an exmaple from each of the Buidig Use Class Types that are representetd in the LDD data population.

In [35]:
## Table Label Helper
#Dont Forget to Remove Sample input
def thumbs(d_frame, title, size, sample):
    printmd('<br><br><b>`' + title +'`</b> &#9663;<br><br>')
    the_frame = eval(d_frame)
    #print(the_frame.shape)
    #imagey = df_ldd_Filter1['TempID'].sample(n=1)
    size = 2        # sample size
    size = sample
    replace = True  # with replacement
    # Ensure we have n>100 Labels 
    thumbsdf = the_frame.groupby('ClassificationCode').filter(lambda x : len(x)>size)
    fn = lambda obj: obj.loc[np.random.choice(obj.index, size, replace),:]
    thumb_gally = thumbsdf.groupby('ClassificationCode', as_index=False).apply(fn)
    thumb_gally2 = thumb_gally[['TempID', 'ClassificationCode', 'class_label', 'Primary_Desc', 'Class_Desc']].values
    gallery = ""

    for count, element in enumerate(thumb_gally2, 1):   
        if thumb_gally2[count-1,0] <= 29192:
            inp2 = '../../_STREETVIEW_EXPLORER/LDD_Complete/col_id_' + str(thumb_gally2[count-1,0]) +'.jpg'
        else:
            inp2 = '../../_STREETVIEW_EXPLORER/LDD_Complete/batch2/col_id_' + str(thumb_gally2[count-1,0]) +'.jpg'
        inp3 = "<div class='gallery'><div class='zoom'><img src='" +  inp2 + "' width='' height=''></div><div class='desc'>" + thumb_gally2[count-1,1] + \
                 ": " + str(thumb_gally2[count-1,4]) + " Image id: " + str(thumb_gally2[count-1,0]) +  "</div></div>"
        gallery = gallery + inp3

    display(HTML(gallery))  
In [ ]:
thumbs("df_ldd_Filter1", "Address Base, Address Types, All", 200, 2)
  • Thumb Galleries on this HTML Presentation Version are Samples Only

Notes
  • Image Quality
  • Missing Images (Unprocessed Batch Run)
  • For more info on Addressbase Codes please see next section



Machine Learning Note:

On Test and Training Sets

One of the things the (Tensorflow Hub) script does under the hood when you point it at a folder of images is divide them up into three different sets. The usual split is to put 80% of the images into the main training set, keep 10% aside to run as validation frequently during training, and then have a final 10% that are used less often as a testing set to predict the real-world performance of the classifier. These ratios can be controlled using the --testing_percentage and --validation_percentage flags. In general you should be able to leave these values at their defaults, since you won't usually find any advantage to training to adjusting them.

Note that the script uses the image filenames (rather than a completely random function) to divide the images among the training, validation, and test sets. This is done to ensure that images don't get moved between training and testing sets on different runs, since that could be a problem if images that had been used for training a model were subsequently used in a validation set.

You might notice that the validation accuracy fluctuates among iterations. Much of this fluctuation arises from the fact that a random subset of the validation set is chosen for each validation accuracy measurement. The fluctuations can be greatly reduced, at the cost of some increase in training time, by choosing --validation_batch_size=-1, which uses the entire validation set for each accuracy computation.

Once training is complete, you may find it insightful to examine misclassified images in the test set. This can be done by adding the flag --print_misclassified_test_images. This may help you get a feeling for which types of images were most confusing for the model, and which categories were most difficult to distinguish. For instance, you might discover that some subtype of a particular category, or some unusual photo angle, is particularly difficult to identify, which may encourage you to add more training images of that subtype. Oftentimes, examining misclassified images can also point to errors in the input data set, such as mislabeled, low-quality, or ambiguous images. However, one should generally avoid point-fixing individual errors in the test set, since they are likely to merely reflect more general problems in the (much larger) training set.

https://www.tensorflow.org/hub/tutorials/image_retraining

Creating a Set of Training Images The first place to start is by looking at the images you've gathered, since the most common issues we see with training come from the data that's being fed in.
For training to work well, you should gather at least a hundred photos of each kind of object you want to recognize. The more you can gather, the better the accuracy of your trained model is likely to be. You also need to make sure that the photos are a good representation of what your application will actually encounter. For example, if you take all your photos indoors against a blank wall and your users are trying to recognize objects outdoors, you probably won't see good results when you deploy.
Another pitfall to avoid is that the learning process will pick up on anything that the labeled images have in common with each other, and if you're not careful that might be something that's not useful. For example if you photograph one kind of object in a blue room, and another in a green one, then the model will end up basing its prediction on the background color, not the features of the object you actually care about. To avoid this, try to take pictures in as wide a variety of situations as you can, at different times, and with different devices.
You may also want to think about the categories you use. It might be worth splitting big categories that cover a lot of different physical forms into smaller ones that are more visually distinct. For example instead of 'vehicle' you might use 'car', 'motorbike', and 'truck'. It's also worth thinking about whether you have a 'closed world' or an 'open world' problem. In a closed world, the only things you'll ever be asked to categorize are the classes of object you know about. This might apply to a plant recognition app where you know the user is likely to be taking a picture of a flower, so all you have to do is decide which species. By contrast a roaming robot might see all sorts of different things through its camera as it wanders around the world. In that case you'd want the classifier to report if it wasn't sure what it was seeing. This can be hard to do well, but often if you collect a large number of typical 'background' photos with no relevant objects in them, you can add them to an extra 'unknown' class in your image folders.
It's also worth checking to make sure that all of your images are labeled correctly. Often user-generated tags are unreliable for our purposes. For example: pictures tagged #daisy might also include people and characters named Daisy. If you go through your images and weed out any mistakes it can do wonders for your overall accuracy.


Stage 2 - Create CNN Classification Label Sets

We are now able to proceed with the process of creating sets of labels to feed into our label pipeline. We are seeking to strike a balance between settling on an informative enough representation of the class under scrutiny(london building types) but which also allows the selected CNN Architectures to achieve the highest precision and accuracy recall rates.

Label Set A - Mapping the OS Addressbase Premium Address (BS7666) Schema to OSM Building Types

Our primary data source for describing proerties was to be found in the Ordnance Survey OS Addressbase Premium Address dataset. This also provides a heirachical and fine grained Address Classification schema based on Builiding Usage. However for the purposes of a CNN Classification Task we are primarliy concerned with finding and adopting a Visual Classficiation schema, one that can define and differnentiate Building Features and Types based on Visual Apperance alone. Whilst the Open Street Map Building Layer Classification schema is also concerned with Use Classes, it also provides a descriptive layer feature schema based on Building Types. As a developing open source intitiative, much of this data is yet to be populated. We therefore need to adopt a crieteria and working methodology for assigning class memebers to classes of Building Types. In this instance, we adopt a Fuzzy Logic approach, whereby set membership is discerned by applying a measure of the degree of potentialset membership charactersitics a given set member candidate is seen to display.

Address Base Schema:

Classifications are sourced from the Local Land and Property Gazetteers and from Ordnance Survey large-scale data.

Note:

As we might expect AddressBase classes do not neccessarily translate well into Machine Learner friendly Classification Labels.

There are over 500 types contained in the schema used by Ordnance Survey. These describe characteristics that go beyond visual appearence and differentiate how a property. Have a look at the Schema Descriptions below. They drill down to Tertiary and Quaternary groupings, and provide a coninruos and fine grained description of the property type.

Open Street Map provide a building description shcema, which at its most basic is also converend with a buidings use. However the value and scema also lend themselves to catagorisation by type and crucially of visual type:
Example Value form the OSM Documentation:
Key:Building Value: Apartments Comment: A building arranged into individual dwellings, often on separate floors. May also have retail outlets on the ground floor.
https://wiki.openstreetmap.org/wiki/Key:building

The first aim of the study is to train a CNN to identify Building Types, so we will focus on Mapping Addressbase classes to the appropiate Open Street Map Building Type categories. The latter better discriminate between building objects with clearly delineated and individual visual characteristics, and so would prove more ML friendly.

Early exporatory CNN Training and Classification runs were tried with different variations of Address base Label,but proved consistently inaccurate. Please See notes on the Auxillary Descriptor runs below.

Open Street Map Schema:

Applying Crisp and Fuzzy Set approaches to Data Filtering:

We will see that the House(Detached and Semi Detached) label Class is both numerous and relatively visually distinct. This will be less so for troublesome Terraces and Metropolitan High street street scenes.

Our other Building categories, esp in an urban setting with a strong mixed use developmnet and urban density focussed planning policy in play, are not so easy to categorise for the purposes of a CNN classification task.

We will apply two approaches to this problem.

1 - Apply Filters to the Generic Building Class by adopting the development scale definitons used in UK Town Planning i.e. Major and Minor Devlopments. This should hopefully provide us with candidate members that possess strong charateristics of the feature class(e.g. Large Block of Modern New Build Flat)

2 - Where these categories are not successull we also apply a Fuzzy Set Membership approach. We select a parameter of choice (e.g. Floor Space, Build Unit Count) that allows for a greater or leeser degree of class membership for the candidate set member. We consider and adjust the Fuzzy Variable Control until we have a satisfactory cut off point and a label that might aid the CNN Building Type Classification Task.

2.1 - Creating House, Flat, Office, Retail and Industrial Property Type Training Label Sets

In [38]:
# Peek at the Prospective Sample Sizes
printmd('<br><br><b>`Primary Class(Adressbase Schema):  CI = Industrial, CO = Office, CR = Retail, R = Residential Dwellings, `</b> &#9663;<br><br>')
df_ldd_Filter2 = df_ldd_Filter1[['ClassificationCode','class_label', 'Primary_Desc']]
df_ldd_Filter2 = df_ldd_Filter2[(df_ldd_Filter2['class_label'].str.contains('CI')) |
                                #(df_ldd_Filter2['class_label'].str.contains('RD'))  |
                                (df_ldd_Filter2['class_label'].str.contains('CR')) |                             
                                (df_ldd_Filter2['class_label'].str.contains('CO'))]
df_ldd_Filter2 =  df_ldd_Filter2.groupby('ClassificationCode').filter(lambda x : len(x)>10)
df2 = df_ldd_Filter2.groupby(['class_label', 'ClassificationCode'])['class_label'].count().unstack('ClassificationCode').fillna(0)
#df2.plot(kind='bar', figsize=(18, 10));
fig = df2.plot(kind='bar', figsize=(18, 10)).get_figure()
fig.savefig('Workflow_images/Prop_Mix_1.png')

df_ldd_Filter2 = df_ldd_Filter1[['ClassificationCode','class_label', 'Primary_Desc']]
df_ldd_Filter2 = df_ldd_Filter2[#(df_ldd_Filter2['class_label'].str.contains('CI')) |
                                (df_ldd_Filter2['class_label'].str.contains('RD'))  #|
                                #(df_ldd_Filter2['class_label'].str.contains('CR')) |                             
                                #(df_ldd_Filter2['class_label'].str.contains('CO'))
                                ]
df_ldd_Filter2 =  df_ldd_Filter2.groupby('ClassificationCode').filter(lambda x : len(x)>10)
df2 = df_ldd_Filter2.groupby(['class_label', 'ClassificationCode'])['class_label'].count().unstack('ClassificationCode').fillna(0)
#df2.plot(kind='bar', figsize=(18, 10));
fig = df2.plot(kind='bar', figsize=(18, 10)).get_figure()
fig.savefig('Workflow_images/Prop_Mix_2.png')

#Future Ref:
#Normalisation - https://matplotlib.org/3.1.0/tutorials/colors/colormapnorms.html



`Primary Class(Adressbase Schema): CI = Industrial, CO = Office, CR = Retail, R = Residential Dwellings, `

- 8090 House Images
- 16814 Flat Images
- 1412 Office IMages
- 2789 Retails Images
- 549 Industrial Images

The Kang et Al Study achieved Accuracy rates of 70 % with a Places 365 munged/reduced dataset of n=2500 samples. The Tensorflow documentation stataes that reasonable success can be achieved with smaple saizes as low as n=100. We will take 2 approaches focusing on Sample Size Numbers and Building Type Diversity respectively.

2.2 Snapshot of the Pre Processed and Filtered Data Set

In [39]:
df_cat_house = df_ldd_Filter1[(df_ldd_Filter1['ClassificationCode'].str.startswith('RD02')) | 
                              (df_ldd_Filter1['ClassificationCode'].str.startswith('RD03'))
                              | (df_ldd_Filter1['ClassificationCode'].str.startswith('RD04'))  ]

df_cat_house.shape

df_cat_flats = df_ldd_Filter1[(df_ldd_Filter1['ClassificationCode'].str.startswith('RD06'))]
df_cat_flats.shape

df_cat_commercial_office = df_ldd_Filter1[(df_ldd_Filter1['ClassificationCode'].str.startswith('CO'))]
df_cat_commercial_office.shape

df_cat_commercial_retail = df_ldd_Filter1[(df_ldd_Filter1['ClassificationCode'].str.startswith('CR'))]
df_cat_commercial_retail.shape

df_cat_indust = df_ldd_Filter1[(df_ldd_Filter1['ClassificationCode'].str.startswith('CI'))]
df_cat_indust.shape
Out[39]:
(8090, 43)
Out[39]:
(16814, 43)
Out[39]:
(1412, 43)
Out[39]:
(2789, 43)
Out[39]:
(549, 43)

2.3 Inspect Use Class Distribution

Explanation of the adopted 200 and 1000 Sample Size approach:

Note on our Label Variations/Trade Offs:


  • Large Number
  • Granular Class Detail
  • Equal Dist between Label Sets
  • Therfore create labels of two sizes dependent on min sample size available(for boro selection)

  • The London Wide Selection will give us greater Training Set numbers, but will offer less Test Set Data to apply and measure the CNN accuracy findings on.

  • Can always try lower mix and apply distortions. Replace = True Numpy Sampler



Test/Train Split and Cross Validation in Tensorflow:

One of the things the (Tensorflow Hub) script does under the hood when you point it at a folder of images is divide them up into three different sets. The usual split is to put 80% of the images into the main training set, keep 10% aside to run as validation frequently during training, and then have a final 10% that are used less often as a testing set to predict the real-world performance of the classifier. These ratios can be controlled using the --testing_percentage and --validation_percentage flags. In general you should be able to leave these values at their defaults, since you won't usually find any advantage to training to adjusting them.

In [40]:
# Just get list of labels for inspecting add base building schema types?
#labels = df_ldd_Filter1['ClassificationCode'].value_counts().reset_index(name="count").query("count > 50")
#labels = df_ldd_Filter1['class_label'].unique()

printmd('<br><br><b>`Number and Type of Properties Built Since 2004`</b> &#9663;<br><br>')
plotdat_label(df_ldd_Filter1,'ClassificationCode')
printmd("<b>`Selected Boroughs:" + str(Borough_Selection) + '`</b><br><br><br><br>')



`Number and Type of Properties Built Since 2004`

`Selected Boroughs:('Bexley', 'Barking and Dagenham', 'Barnet', 'Brent', 'Bromley', 'Camden', 'Croydon', 'City of London', 'Enfield', 'Ealing', 'Greenwich', 'Hackney', 'Haringey', 'Hammersmith and Fulham', 'Harrow', 'Havering', 'Hounslow', 'Hillingdon', 'Kensington and Chelsea', 'Islington', 'Kingston upon Thames', 'Lewisham', 'Lambeth', 'Redbridge', 'Newham', 'Merton', 'London Legacy DC', 'Southwark', 'Sutton', 'Richmond upon Thames', 'Waltham Forest', 'Wandsworth', 'Tower Hamlets', 'Westminster')`



2.4 - Create House Property Type Label:

In [45]:
# We could remove terraces and semi's to avoid confusion with mixed use high street street 
# typlogies(e.g. buildings & streets with shops, below and residential or office above)

# HOUSE 
# House Sample Count is Low(we are in London) no need to standardize/equalize sample size
# Filter on All Address Base House Types:
df_cat_house = df_ldd_Filter1[(df_ldd_Filter1['ClassificationCode'].str.startswith('RD02')) | 
                              (df_ldd_Filter1['ClassificationCode'].str.startswith('RD03'))
                              | (df_ldd_Filter1['ClassificationCode'].str.startswith('RD04'))  ]

df_cat_house.shape
df_cat_house.describe()
Out[45]:
(8090, 43)
Out[45]:
TempID Resi_Site_Prop_ha Non_Resi_Site_Prop_ha Total_Open_ Space_Exist_ha Total_Open_ Space_Prop_ha Total_Site_Area_Prop_ha Existing_Total_Residential_Units Proposed_Total_Residential_Units Proposed_TotalAffordable_Units Proposed_Total_Affordable_Percentage Proposed_Residential_Parking_Spaces Existing_Total_Bedrooms Proposed_Total_Bedrooms Existing_Total_Floorspace Proposed_Total_Floorspace Latitude Longitude Tertiary_Code Unnamed: 0 Large User Deleted Acorn Category Acorn Type
count 8090.000000 8090.000000 8090.000000 8090.000000 8090.000000 8090.000000 8090.000000 8090.000000 8090.000000 8090.000000 8090.000000 8090.000000 8090.000000 8090.000000 8090.000000 8090.000000 8090.000000 8090.000000 5.446000e+03 5446.000000 5446.000000 5371.000000 5371.000000
mean 32516.166749 0.062631 0.004379 0.005513 0.001171 0.068181 0.871199 3.417676 0.861187 5.335476 3.031891 0.224104 0.172064 78.168480 25.364030 51.496133 -0.117451 3.116316 1.363628e+06 0.012119 0.013772 2.517036 23.232545
std 19303.975401 0.216708 0.087213 0.126687 0.060534 0.260717 6.556296 16.761604 8.159592 21.970297 15.430509 2.633943 2.320753 825.568495 968.783741 0.080279 0.175467 0.866653 7.220317e+05 0.109427 0.116552 1.345466 15.879226
min 4.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 51.293497 -0.498269 2.000000 2.374630e+05 0.000000 0.000000 1.000000 1.000000
25% 14920.750000 0.014000 0.000000 0.000000 0.000000 0.015000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 51.432644 -0.237186 2.000000 7.164422e+05 0.000000 0.000000 1.000000 11.000000
50% 32965.500000 0.026000 0.000000 0.000000 0.000000 0.027000 0.000000 1.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 51.499794 -0.143002 3.000000 1.403481e+06 0.000000 0.000000 2.000000 20.000000
75% 49363.500000 0.056000 0.000000 0.000000 0.000000 0.058000 1.000000 2.000000 0.000000 0.000000 2.000000 0.000000 0.000000 0.000000 0.000000 51.564126 0.020858 4.000000 2.039091e+06 0.000000 0.000000 3.000000 29.000000
max 65871.000000 7.744000 5.250000 7.012000 5.300000 8.997000 420.000000 587.000000 468.000000 100.000000 662.000000 96.000000 83.000000 39778.000000 85000.000000 51.677610 0.276986 4.000000 2.386637e+06 1.000000 1.000000 6.000000 62.000000
In [46]:
#Get Samples
#House Sample Size > 1000
#The Other AB Classes will require code for Setting Max and Mins
df_cat_house_1000 = df_cat_house.sample(n=1000, random_state=1)
df_cat_house_200 = df_cat_house.sample(n=200, random_state=1)

#df_cat_house = df_cat_house.sample(n=200, random_state=1)
df_cat_house.shape

#Test Train Split
# 1000s Image Sample Label Set
# GReater Sample Size so can afford Large Test Set for the Classify and Map Visualisation Stages
msk = np.random.rand(len(df_cat_house_1000)) < 0.80
df_cat_house_1000_train = df_cat_house_1000[msk]   
df_cat_house_1000_test = df_cat_house_1000[~msk]
df_cat_house_1000_train.shape
df_cat_house_1000_test.shape

# 100s Image Sample Label Set
msk = np.random.rand(len(df_cat_house_200)) < 0.9
df_cat_house_200_train = df_cat_house_200[msk]   
df_cat_house_200_test = df_cat_house_200[~msk]
df_cat_house_200_train.shape
df_cat_house_200_test.shape
Out[46]:
(8090, 43)
Out[46]:
(792, 43)
Out[46]:
(208, 43)
Out[46]:
(181, 43)
Out[46]:
(19, 43)
In [47]:
df_cat_house = df_ldd_Filter1[(df_ldd_Filter1['ClassificationCode'].str.startswith('RD02')) | 
                              (df_ldd_Filter1['ClassificationCode'].str.startswith('RD03'))
                              | (df_ldd_Filter1['ClassificationCode'].str.startswith('RD04'))  ]

df_cat_house.shape
df_cat_house.describe()
Out[47]:
(8090, 43)
Out[47]:
TempID Resi_Site_Prop_ha Non_Resi_Site_Prop_ha Total_Open_ Space_Exist_ha Total_Open_ Space_Prop_ha Total_Site_Area_Prop_ha Existing_Total_Residential_Units Proposed_Total_Residential_Units Proposed_TotalAffordable_Units Proposed_Total_Affordable_Percentage Proposed_Residential_Parking_Spaces Existing_Total_Bedrooms Proposed_Total_Bedrooms Existing_Total_Floorspace Proposed_Total_Floorspace Latitude Longitude Tertiary_Code Unnamed: 0 Large User Deleted Acorn Category Acorn Type
count 8090.000000 8090.000000 8090.000000 8090.000000 8090.000000 8090.000000 8090.000000 8090.000000 8090.000000 8090.000000 8090.000000 8090.000000 8090.000000 8090.000000 8090.000000 8090.000000 8090.000000 8090.000000 5.446000e+03 5446.000000 5446.000000 5371.000000 5371.000000
mean 32516.166749 0.062631 0.004379 0.005513 0.001171 0.068181 0.871199 3.417676 0.861187 5.335476 3.031891 0.224104 0.172064 78.168480 25.364030 51.496133 -0.117451 3.116316 1.363628e+06 0.012119 0.013772 2.517036 23.232545
std 19303.975401 0.216708 0.087213 0.126687 0.060534 0.260717 6.556296 16.761604 8.159592 21.970297 15.430509 2.633943 2.320753 825.568495 968.783741 0.080279 0.175467 0.866653 7.220317e+05 0.109427 0.116552 1.345466 15.879226
min 4.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 51.293497 -0.498269 2.000000 2.374630e+05 0.000000 0.000000 1.000000 1.000000
25% 14920.750000 0.014000 0.000000 0.000000 0.000000 0.015000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 51.432644 -0.237186 2.000000 7.164422e+05 0.000000 0.000000 1.000000 11.000000
50% 32965.500000 0.026000 0.000000 0.000000 0.000000 0.027000 0.000000 1.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 51.499794 -0.143002 3.000000 1.403481e+06 0.000000 0.000000 2.000000 20.000000
75% 49363.500000 0.056000 0.000000 0.000000 0.000000 0.058000 1.000000 2.000000 0.000000 0.000000 2.000000 0.000000 0.000000 0.000000 0.000000 51.564126 0.020858 4.000000 2.039091e+06 0.000000 0.000000 3.000000 29.000000
max 65871.000000 7.744000 5.250000 7.012000 5.300000 8.997000 420.000000 587.000000 468.000000 100.000000 662.000000 96.000000 83.000000 39778.000000 85000.000000 51.677610 0.276986 4.000000 2.386637e+06 1.000000 1.000000 6.000000 62.000000
In [48]:
printmd('<br><br><b>`Number of New Build House by AB Primary Class`</b> ' + ' &#9663;<br><br>')
df_proptype = df_cat_house.groupby(['Planning_A', 'Tertiary_Desc'])['Planning_A'].count().unstack('Tertiary_Desc').fillna(0)
df_proptype.plot(kind='bar', stacked=True, figsize=(18, 10));
#printmd("<b>`Selected Boroughs: London Wide`</b><br><br><br><br>")
#df2.plot(kind='bar', figsize=(18, 10));



`Number of New Build House by AB Primary Class`

Note


Unsuprisingly we have considerably less houses in the predominantly inner urban borough of Tower Hamlets. This shoudnt skew the neutrality of the CNN Classification (we are attempting a label classifcation of London Wide Houses rather than houses classfied by Borough - although see earlier notes on auxilary runs for consideration of this approach). We will need to ensure we select equal amount of training labels from each borough.

Also seems quite low given its over 10 Years - although new houses in the city rare, its always flats(London Plan and making best use of Land etc)

Inspect Image Data for House Building Types
In [ ]:
thumbs("df_cat_house", "House Building Types", 300, 2)

2.5 - Create Flat Property Type Label:

In [50]:
#FLATS
#Get AB Code RD06
df_cat_flats = df_ldd_Filter1[(df_ldd_Filter1['ClassificationCode'].str.startswith('RD06'))]
df_cat_flats.shape
#df_cat_flats.describe()
Out[50]:
(16814, 43)
In [51]:
#Remove Flat Conversions, which the CNN easily confuses with the House Building Type(using ampersand)
df_cat_flats_no_conv = df_cat_flats[(~df_cat_flats['Developmen'].str.contains('Conversion')) 
                                  & (~df_cat_flats['Developmen'].str.contains('Change of use'))
                                  & (~df_cat_flats['Developmen'].str.contains('change of use'))
                                  & (~df_cat_flats['Developmen'].str.contains('conversion'))]


#df_cat_flats_no_conv = df_cat_flats_no_conv.sample(n=1000, random_state=1)
#df_cat_flats_no_conv = df_cat_flats_no_conv.sample(n=200, random_state=1)
                                                                   
#Contains Flat Conversions
df_cat_flats_conv = df_cat_flats[(df_cat_flats['Developmen'].str.contains('Conversion')) 
                                  & (df_cat_flats['Developmen'].str.contains('Change of use'))
                                  & (df_cat_flats['Developmen'].str.contains('change of use'))
                                  & (df_cat_flats['Developmen'].str.contains('conversion'))]
                                    

#As mentioned earlier, we create  label sets with 1000 and 200 Sample Sizes to match min and max sizes 
# of the full set of labels(House, Flat, Industrial, Retail and Office) we are attempting to Classify
df_cat_flats_1000 = df_cat_flats.sample(n=1000, random_state=1)
df_cat_flats_200 = df_cat_flats.sample(n=200, random_state=1)

#Get and Set Max Sample Size 
max_flats = max_major_flats = max_minor_flats = max_fuzzy_flats = 1000
# Borough Selection yields less than 1000 Records, so get as many as possible
if len(df_cat_flats[df_cat_flats['Proposed_Total_Residential_Units'] > 10]) < 1000:
    max_major_flats = len(df_cat_flats[df_cat_flats['Proposed_Total_Residential_Units'] > 10]) 
elif len(df_cat_flats[df_cat_flats['Proposed_Total_Residential_Units'] < 10]) < 1000:
    max_minor_flats = len(df_cat_flats[df_cat_flats['Proposed_Total_Residential_Units'] > 10]) 
elif len(df_cat_flats[df_cat_flats['Proposed_Total_Residential_Units'] < 10]) < 1000:
    max_fuzzy_flats = len(df_cat_flats[df_cat_flats['Proposed_Total_Residential_Units'] > 10]) 
    

#Use LDD Criteria to Create Major Developmnet Flat Type Class
df_cat_flats_major_1000 = df_cat_flats[df_cat_flats['Proposed_Total_Residential_Units'] > 10].sample(n= max_major_flats, random_state=1)
df_cat_flats_major_200 = df_cat_flats[df_cat_flats['Proposed_Total_Residential_Units'] > 10].sample(n=200, random_state=1)


#Use LDD Crieteria to Create Minor Developmnet Flat Type Class
df_cat_flats_minor_1000 = df_cat_flats[df_cat_flats['Proposed_Total_Residential_Units'] < 10].sample(n=max_minor_flats, random_state=1)
df_cat_flats_minor_200 = df_cat_flats[df_cat_flats['Proposed_Total_Residential_Units'] < 10].sample(n=200, random_state=1)

                                 
#Fuzzy Filter: > 5 Units Removes Retail Terraces Building Types
df_cat_flats_fuzzy_units_1000 = df_cat_flats[df_cat_flats['Proposed_Total_Residential_Units'] > 5].sample(n=max_fuzzy_flats, random_state=1)
df_cat_flats_fuzzy_units_200 = df_cat_flats[df_cat_flats['Proposed_Total_Residential_Units'] > 5].sample(n=200, random_state=1)
In [52]:
# Split into Test and Train Sets:
# As mentioned earlier, the Tensorflow CNN Function will split data into Train, Validation & Test Sets
# so we need only a small amount for our Final Test Set

#Generic Type
msk = np.random.rand(len(df_cat_flats_1000)) < 0.8
df_cat_flats_1000_train = df_cat_flats_1000[msk]   
df_cat_flats_1000_test = df_cat_flats_1000[~msk]

msk = np.random.rand(len(df_cat_flats_200)) < 0.8
df_cat_flats_200_train = df_cat_flats_200[msk]   
df_cat_flats_200_test = df_cat_flats_200[~msk]

#Major Dev Flat Type - 1000 Sample Size
msk = np.random.rand(len(df_cat_flats_major_1000)) < 0.9
df_cat_flats_major_1000_train = df_cat_flats_major_1000[msk]   
df_cat_flats_major_1000_test = df_cat_flats_major_1000[~msk]

#Major Dev Flat Type - 200 Sample Size
msk = np.random.rand(len(df_cat_flats_major_200)) < 0.9
df_cat_flats_major_200_train = df_cat_flats_major_200[msk]   
df_cat_flats_major_200_test = df_cat_flats_major_200[~msk]

#Minor Dev Flat Type 
msk = np.random.rand(len(df_cat_flats_minor_1000)) < 0.8
df_cat_flats_minor_1000_train = df_cat_flats_minor_1000[msk]   
df_cat_flats_minor_1000_test = df_cat_flats_minor_1000[~msk]

msk = np.random.rand(len(df_cat_flats_minor_200)) < 0.8
df_cat_flats_minor_200_train = df_cat_flats_minor_200[msk]   
df_cat_flats_minor_200_test = df_cat_flats_minor_200[~msk]

#Fuzzy Dev Flat Type 
msk = np.random.rand(len(df_cat_flats_fuzzy_units_200)) < 0.9
df_cat_flats_fuzzy_units_200_train = df_cat_flats_fuzzy_units_200[msk]   
df_cat_flats_fuzzy_units_200_test = df_cat_flats_fuzzy_units_200[~msk]

msk = np.random.rand(len(df_cat_flats_fuzzy_units_1000)) < 0.9
df_cat_flats_fuzzy_units_1000_train = df_cat_flats_fuzzy_units_1000[msk]   
df_cat_flats_fuzzy_units_1000_test = df_cat_flats_fuzzy_units_1000[~msk]
In [53]:
printmd('<br><br><b>`Number of Flat Developments per CNN Classifier Label Mix by Selected Boroughs`</b> &#9663;<br><br>')
fig=plt.figure(figsize=(15,8))
plotdat_group(df_cat_flats_fuzzy_units_1000,'Planning_A', 221, 'Fuzzy Flats')
plotdat_group(df_cat_flats_1000,'Planning_A', 222, 'Flats')
plotdat_group(df_cat_flats_major_1000,'Planning_A', 223, 'Major Development Flat Type')
plotdat_group(df_cat_flats_minor_1000,'Planning_A', 224, 'Minor Development Flat Type')
plt.tight_layout()
plt.show()



`Number of Flat Developments per CNN Classifier Label Mix by Selected Boroughs`

As can be seen from the above, we now have 3 Variations of Flat Class Definition with which to fine tune and determine the greatest recall accuracy with the Inception CNN Model.

The latter 2 categories focus on scale of development, as outlined, for example, in The Town and Country Planning (Development Procedure) (England) Order 2010 Management. Criteria for Developments to be considered in the Major Category include 10+ dwellings / over half a hectare / building(s) exceeds 1000m². Minor Dwellings are therefore 1-9 dwellings (unless floorspace exceeds 1000m² / under half a hectare. Ultimately we are seeking to differentitate between Large Blocks of Flats and smaller Building development Object Types.

Note that we have no properties in the Flat_Conversion Category, which accords with our study's focus on New Build Development. These have been excluded as part of the Data Model Build Stage. Please note that our rather clumsy pattern matching may inadvertandly also be excluding conversions that are relate not just to single flat, e.g. conversion of offices B1a to C3 residential use( See the Town and Country Planning (General Permitted Development) (England) Order 2015 )

Our Set Controls in this category are provided by text from the proposal field and Residential Unit Counts, both provided by the LDD

Inspect Image Data:
In [ ]:
thumbs("df_cat_flats_major_1000", "Major Development", 100, 2)
thumbs("df_cat_flats_minor_1000", "Minor Development", 100, 2)
thumbs("df_cat_flats_fuzzy_units_1000", "Fuzzy Set Control", 100, 2)
thumbs("df_cat_flats_no_conv", "No Conversion", 100, 2)

2.6 - Create Office Property Type Label:

In [55]:
#OFFICES
#Get AB Codes CO%
df_cat_commercial_office = df_ldd_Filter1[(df_ldd_Filter1['ClassificationCode'].str.startswith('CO'))]
df_cat_commercial_office.shape
#df_cat_commercial_office.describe()
Out[55]:
(1412, 43)
In [56]:
# Create Class Membership Control 
#df_cat_commercial_office['FloorSpace'].describe()
df_cat_commercial_office['FloorSpace'] = df_cat_commercial_office['Existing_Total_Floorspace'] + df_cat_commercial_office['Proposed_Total_Floorspace']

max_office = max_major_office = max_minor_office = 1000

#Get and Set Max Sample Size 
if len(df_cat_commercial_office[df_cat_commercial_office['FloorSpace'] > 100]) < 1000:
    max_major_office = len(df_cat_commercial_office[df_cat_commercial_office['FloorSpace'] > 100]) 
if len(df_cat_commercial_office[df_cat_commercial_office['FloorSpace'] > 100]) < 1000:
    max_minor_office = len(df_cat_commercial_office[df_cat_commercial_office['FloorSpace'] < 100]) 
if len(df_cat_commercial_office) < 1000:
    max_office = len(df_cat_commercial_office) 
    print("test")
    
df_cat_commercial_office_1000 = df_cat_commercial_office.sample(n=max_office, random_state=1)
df_cat_commercial_office_200 = df_cat_commercial_office.sample(n=200, random_state=1)

df_cat_commercial_office_major_1000 = df_cat_commercial_office[df_cat_commercial_office['FloorSpace'] > 100].sample(n=max_major_office, random_state=1)
df_cat_commercial_office_major_200 = df_cat_commercial_office[df_cat_commercial_office['FloorSpace'] > 100].sample(n=200, random_state=1)

df_cat_commercial_office_minor_1000 = df_cat_commercial_office[df_cat_commercial_office['FloorSpace'] < 100].sample(n=max_minor_office, random_state=1)
df_cat_commercial_office_minor_200 = df_cat_commercial_office[df_cat_commercial_office['FloorSpace'] < 100].sample(n=max_minor_office, random_state=1)
/Users/anthonysutton/ml2/env/lib/python3.6/site-packages/ipykernel_launcher.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until
In [57]:
#Testing and Troubleshooting
#df_cat_commercial_office[df_cat_commercial_office['TempID'] == 2549]
In [58]:
#Test Train Split
#Generic Office Types
#Lower Sample Size so bump up the Train Set
msk = np.random.rand(len(df_cat_commercial_office_200)) < 0.95
df_cat_commercial_office_200_train = df_cat_commercial_office_200[msk]   
df_cat_commercial_office_200_test = df_cat_commercial_office_200[~msk]

msk = np.random.rand(len(df_cat_commercial_office_1000)) < 0.8
df_cat_commercial_office_1000_train = df_cat_commercial_office_1000[msk]   
df_cat_commercial_office_1000_test = df_cat_commercial_office_1000[~msk]

# Minor Office Development Types
msk = np.random.rand(len(df_cat_commercial_office_minor_1000)) < 0.8
df_cat_commercial_office_minor_1000_train = df_cat_commercial_office_minor_1000[msk]   
df_cat_commercial_office_minor_1000_test = df_cat_commercial_office_minor_1000[~msk]

msk = np.random.rand(len(df_cat_commercial_office_minor_200)) < 0.95
df_cat_commercial_office_minor_200_train = df_cat_commercial_office_minor_200[msk]   
df_cat_commercial_office_minor_200_test = df_cat_commercial_office_minor_200[~msk]

# Major Office Development Types
msk = np.random.rand(len(df_cat_commercial_office_major_1000)) < 0.8
df_cat_commercial_office_major_1000_train = df_cat_commercial_office_major_1000[msk]   
df_cat_commercial_office_major_1000_test = df_cat_commercial_office_major_1000[~msk]

msk = np.random.rand(len(df_cat_commercial_office_major_200)) < 0.95
df_cat_commercial_office_major_200_train = df_cat_commercial_office_major_200[msk]   
df_cat_commercial_office_major_200_test = df_cat_commercial_office_major_200[~msk]
In [59]:
printmd('<br><br><b>`Number of Office Developments per CNN Classifier Label Mix by Selected Boroughs`</b> &#9663;<br><br>')
fig=plt.figure(figsize=(15,12))
plotdat_group(df_cat_commercial_office_1000_train,'Planning_A', 221, 'Office')
plotdat_group(df_cat_commercial_office_200_train,'Planning_A', 222, 'London Wide - Office')
plotdat_group(df_cat_commercial_office_major_1000_train,'Planning_A', 223, 'Major Development Office Type')
plotdat_group(df_cat_commercial_office_minor_1000_train,'Planning_A', 224, 'Minor Development Office Type')
plt.tight_layout()
plt.show()



`Number of Office Developments per CNN Classifier Label Mix by Selected Boroughs`

Inspect Image Data:
In [ ]:
thumbs('df_cat_commercial_office_minor_1000_train', 'Minor Office Only', 100, 2)
thumbs('df_cat_commercial_office_major_1000_train', 'Major Office Only', 100, 2)
thumbs('df_cat_commercial_office_1000_train', 'Office - All', 100, 2)

2.7 - Create Retail Property Type Label:

In [61]:
# RETAIL
#Get AB Code CO
df_cat_commercial_retail = df_ldd_Filter1[(df_ldd_Filter1['ClassificationCode'].str.startswith('CR'))]
df_cat_commercial_retail.shape
df_cat_commercial_retail.describe()
Out[61]:
(2789, 43)
Out[61]:
TempID Resi_Site_Prop_ha Non_Resi_Site_Prop_ha Total_Open_ Space_Exist_ha Total_Open_ Space_Prop_ha Total_Site_Area_Prop_ha Existing_Total_Residential_Units Proposed_Total_Residential_Units Proposed_TotalAffordable_Units Proposed_Total_Affordable_Percentage Proposed_Residential_Parking_Spaces Existing_Total_Bedrooms Proposed_Total_Bedrooms Existing_Total_Floorspace Proposed_Total_Floorspace Latitude Longitude Tertiary_Code Unnamed: 0 Large User Deleted Acorn Category Acorn Type
count 2789.000000 2789.000000 2789.000000 2789.000000 2789.000000 2789.000000 2789.000000 2789.000000 2789.000000 2789.000000 2789.000000 2789.000000 2789.000000 2789.000000 2789.000000 2789.000000 2789.000000 2509.000000 1.988000e+03 1988.000000 1988.000000 1918.000000 1918.000000
mean 34844.882754 0.027782 0.035648 0.001007 0.000824 0.064253 0.522768 6.490140 1.372535 2.975977 2.189674 0.641449 2.394765 526.823234 615.406956 51.504571 -0.118155 7.487047 1.375373e+06 0.037726 0.035211 2.924400 27.225235
std 19495.014154 0.113927 0.202654 0.022313 0.019074 0.243187 2.206015 29.727076 11.001792 15.478677 12.837393 18.057332 23.489690 3297.855759 4559.168684 0.069200 0.127908 1.585524 6.814646e+05 0.190582 0.184360 1.326673 14.845662
min 47.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 51.309772 -0.482140 1.000000 2.373730e+05 0.000000 0.000000 1.000000 2.000000
25% 17921.000000 0.004000 0.000000 0.000000 0.000000 0.006000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 51.460425 -0.198795 8.000000 7.164508e+05 0.000000 0.000000 2.000000 16.000000
50% 36146.000000 0.009000 0.000000 0.000000 0.000000 0.012000 0.000000 2.000000 0.000000 0.000000 0.000000 0.000000 0.000000 40.000000 0.000000 51.513693 -0.110480 8.000000 1.403419e+06 0.000000 0.000000 2.000000 20.000000
75% 52506.000000 0.017000 0.002000 0.000000 0.000000 0.028000 1.000000 4.000000 0.000000 0.000000 0.000000 0.000000 0.000000 170.000000 0.000000 51.553263 -0.046071 8.000000 2.039394e+06 0.000000 0.000000 4.000000 36.000000
max 66214.000000 3.300000 4.870000 0.747000 0.710000 4.870000 67.000000 734.000000 297.000000 100.000000 325.000000 909.000000 602.000000 82290.000000 117739.000000 51.665857 0.258354 11.000000 2.387253e+06 1.000000 1.000000 6.000000 62.000000
In [62]:
# Create Class Membership Control 
df_cat_commercial_retail['FloorSpace'] = df_cat_commercial_retail['Existing_Total_Floorspace'] + df_cat_commercial_retail['Proposed_Total_Floorspace']
#df_cat_commercial_retail['FloorSpace'].describe()


max_retail = max_major_retail = max_minor_retail = 1000

#Get and Set Max Sample Size 
#Borough Selection yields less than 1000 Records, so get as many as possible
if len(df_cat_commercial_retail[df_cat_commercial_retail['FloorSpace'] > 1000]) < 1000:
    max_major_retail = len(df_cat_commercial_retail[df_cat_commercial_retail['FloorSpace'] > 1000]) 
if len(df_cat_commercial_retail[df_cat_commercial_retail['FloorSpace'] < 1000]) < 1000:
    max_minor_retail = len(df_cat_commercial_retail[df_cat_commercial_retail['FloorSpace'] < 1000]) 
if len(df_cat_commercial_retail) < 1000:
    max_retail = len(df_cat_commercial_retail) 


#Generic Retail
df_cat_commercial_200_retail = df_cat_commercial_retail.sample(n=200, random_state=1)
df_cat_commercial_1000_retail = df_cat_commercial_retail.sample(n=1000, random_state=1)

#Major Retail
df_cat_commercial_retail_1000_major = df_cat_commercial_retail[df_cat_commercial_retail['FloorSpace'] > 1000].sample(n=max_major_retail, random_state=1)
df_cat_commercial_retail_200_major = df_cat_commercial_retail[df_cat_commercial_retail['FloorSpace'] > 1000].sample(n=max_major_retail, random_state=1)

#Minor Retail
df_cat_commercial_retail_200_minor = df_cat_commercial_retail[df_cat_commercial_retail['FloorSpace'] < 1000].sample(n=max_minor_retail, random_state=1)
df_cat_commercial_retail_1000_minor = df_cat_commercial_retail[df_cat_commercial_retail['FloorSpace'] < 1000].sample(n=max_minor_retail, random_state=1)
/Users/anthonysutton/ml2/env/lib/python3.6/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
In [63]:
#Test Train Split
#Generic Retail
msk = np.random.rand(len(df_cat_commercial_retail_200_major)) < 0.95
df_cat_commercial_retail_200_major_train = df_cat_commercial_retail_200_major[msk]   
df_cat_commercial_retail_200_major_test = df_cat_commercial_retail_200_major[~msk]

msk = np.random.rand(len(df_cat_commercial_retail_1000_major)) < 0.8
df_cat_commercial_retail_1000_major_train = df_cat_commercial_retail_1000_major[msk]   
df_cat_commercial_retail_1000_major_test = df_cat_commercial_retail_1000_major[~msk]

#Minor Retail
msk = np.random.rand(len(df_cat_commercial_retail_1000_minor)) < 0.8
df_cat_commercial_retail_1000_minor_train = df_cat_commercial_retail_1000_minor[msk]   
df_cat_commercial_retail_1000_minor_test = df_cat_commercial_retail_1000_minor[~msk]

msk = np.random.rand(len(df_cat_commercial_retail_200_minor)) < 0.95
df_cat_commercial_retail_200_minor_train = df_cat_commercial_retail_200_minor[msk]   
df_cat_commercial_retail_200_minor_test = df_cat_commercial_retail_200_minor[~msk]

#Major Retail
msk = np.random.rand(len(df_cat_commercial_200_retail)) < 0.95
df_cat_commercial_retail_200_train = df_cat_commercial_200_retail[msk]   
df_cat_commercial_retail_200_test = df_cat_commercial_200_retail[~msk]

msk = np.random.rand(len(df_cat_commercial_1000_retail)) < 0.8
df_cat_commercial_retail_1000_train = df_cat_commercial_1000_retail[msk]   
df_cat_commercial_retail_1000_test = df_cat_commercial_1000_retail[~msk]
In [64]:
printmd('<br><br><b>`Number of New Build By Property Type in ' + str(Borough_Selection) + '`</b> &#9663;<br><br>')
df_proptype = df_cat_commercial_retail.groupby(['Planning_A', 'ClassificationCode'])['Planning_A'].count().unstack('ClassificationCode').fillna(0)
df_proptype.plot(kind='bar', stacked=True, figsize=(18, 10));

printmd('<br><br><b>`Number of Retail Developments per CNN Classifier Label Mix by Selected Boroughs`</b> &#9663;<br><br>')
fig=plt.figure(figsize=(15,12))
plotdat_group_null(df_cat_commercial_retail_1000_train,'Planning_A', 221, 'Retail')
plotdat_group_null(df_cat_commercial_retail_1000_train,'Planning_A', 222, 'Retail - London Wide ??')
plotdat_group_null(df_cat_commercial_retail_1000_major_train,'Planning_A', 223, 'Retail Development Major Type')
plotdat_group_null(df_cat_commercial_retail_1000_minor_train,'Planning_A', 224, 'Retail Development Minor Type')
plt.tight_layout()
plt.show()



`Number of New Build By Property Type in ('Bexley', 'Barking and Dagenham', 'Barnet', 'Brent', 'Bromley', 'Camden', 'Croydon', 'City of London', 'Enfield', 'Ealing', 'Greenwich', 'Hackney', 'Haringey', 'Hammersmith and Fulham', 'Harrow', 'Havering', 'Hounslow', 'Hillingdon', 'Kensington and Chelsea', 'Islington', 'Kingston upon Thames', 'Lewisham', 'Lambeth', 'Redbridge', 'Newham', 'Merton', 'London Legacy DC', 'Southwark', 'Sutton', 'Richmond upon Thames', 'Waltham Forest', 'Wandsworth', 'Tower Hamlets', 'Westminster')`

Out[64]:
<matplotlib.axes._subplots.AxesSubplot at 0x10a2b3be0>



`Number of Retail Developments per CNN Classifier Label Mix by Selected Boroughs`

Inspect Image Data:
In [ ]:
thumbs("df_cat_commercial_retail_1000_minor_train", "Minor Retail", 100, 2)
thumbs("df_cat_commercial_retail_1000_major_train", "Major Retail", 100, 2)
thumbs("df_cat_commercial_retail_1000_train", "Generic Retail", 100, 2)

2.8 - Create Industrial Property Type Label:

In [66]:
# INDUSTRIAL
#Get AB Code 'CI'
df_cat_indust = df_ldd_Filter1[(df_ldd_Filter1['ClassificationCode'].str.startswith('CI'))]
df_cat_indust.shape
#df_cat_indust.describe() 
Out[66]:
(549, 43)
In [67]:
#Create Class Membership Control 
df_cat_indust['FloorSpace'] = df_cat_indust['Existing_Total_Floorspace'] + df_cat_indust['Proposed_Total_Floorspace']


#df_cat_indust_minor.shape

#df_cat_indust_major =df_cat_indust[df_cat_indust['FloorSpace'] > 1000]
#df_cat_indust_major.shape

#df_cat_indust['FloorSpace'].describe() 


max_ind = max_major_ind = max_minor_office = 1000

# Borough Selection yields less than 1000 Records, so get as many as possible
if len(df_cat_indust[df_cat_indust['FloorSpace'] < 1000]) < 1000:
    max_major_ind = len(df_cat_indust[df_cat_indust['FloorSpace'] > 1000]) 
if len(df_cat_indust[df_cat_indust['FloorSpace'] > 1000]) < 1000:
    max_minor_ind = len(df_cat_indust[df_cat_indust['FloorSpace'] < 1000]) 
if len(df_cat_indust) < 1000:
    max_ind = len(df_cat_indust) 

df_cat_indust_1000 = df_ldd_Filter1[(df_ldd_Filter1['ClassificationCode'].str.startswith('CI'))].sample(n=max_ind, random_state=1)
df_cat_indust_200 = df_ldd_Filter1[(df_ldd_Filter1['ClassificationCode'].str.startswith('CI'))].sample(n=200, random_state=1)

df_cat_indust_minor_1000 =df_cat_indust[df_cat_indust['FloorSpace'] < 1000].sample(n=max_minor_ind, random_state=1)
df_cat_indust_minor_200 =df_cat_indust[df_cat_indust['FloorSpace'] < 1000].sample(n=max_minor_ind, random_state=1)

df_cat_indust_major_1000 =df_cat_indust[df_cat_indust['FloorSpace'] > 1000].sample(n=max_major_ind, random_state=1)
df_cat_indust_major_200 =df_cat_indust[df_cat_indust['FloorSpace'] > 1000].sample(n=max_major_ind, random_state=1)
/Users/anthonysutton/ml2/env/lib/python3.6/site-packages/ipykernel_launcher.py:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.
In [68]:
#Test Train Split
#Low Count so bump up all Train sets
#Generic Industrial
msk = np.random.rand(len(df_cat_indust_1000)) < 0.95
df_cat_indust_1000_train = df_cat_indust_1000[msk]   
df_cat_indust_1000_test = df_cat_indust_1000[~msk]

msk = np.random.rand(len(df_cat_indust_200)) < 0.95
df_cat_indust_200_train = df_cat_indust_200[msk]   
df_cat_indust_200_test = df_cat_indust_200[~msk]

#Major Industrial
msk = np.random.rand(len(df_cat_indust_major_1000)) < 0.95
df_cat_indust_major_1000_train = df_cat_indust_major_1000[msk]   
df_cat_indust_major_1000_test = df_cat_indust_major_1000[~msk]

msk = np.random.rand(len(df_cat_indust_major_200)) < 0.95
df_cat_indust_major_200_train = df_cat_indust_major_200[msk]   
df_cat_indust_major_200_test = df_cat_indust_major_200[~msk]

#Minor Industrial
msk = np.random.rand(len(df_cat_indust_minor_1000)) < 0.95
df_cat_indust_minor_1000_train = df_cat_indust_minor_1000[msk]   
df_cat_indust_minor_1000_test = df_cat_indust_minor_1000[~msk]

msk = np.random.rand(len(df_cat_indust_major_200)) < 0.95
df_cat_indust_major_200_train = df_cat_indust_major_200[msk]   
df_cat_indust_major_200_test = df_cat_indust_major_200[~msk]
In [69]:
printmd('<br><br><b>`Number of New Build By Property Type in ' + str(Borough_Selection) + '`</b> &#9663;<br><br>')
df_proptype = df_cat_indust.groupby(['Planning_A', 'ClassificationCode'])['Planning_A'].count().unstack('ClassificationCode').fillna(0)
df_proptype.plot(kind='bar', stacked=True, figsize=(18, 10));

printmd('<br><br><b>`Number of Industrial Developments per CNN Classifier Label Mix by Selected Boroughs`</b> &#9663;<br><br>')
fig=plt.figure(figsize=(15,12))
plotdat_group(df_cat_indust_1000_train,'Planning_A', 221, 'Generic Industrial')
plotdat_group(df_cat_indust_major_1000_train,'Planning_A', 223, 'Major Development Type')
plotdat_group(df_cat_indust_minor_1000_train,'Planning_A', 224, 'Minor Development Type')
plt.tight_layout()
plt.show()



`Number of New Build By Property Type in ('Bexley', 'Barking and Dagenham', 'Barnet', 'Brent', 'Bromley', 'Camden', 'Croydon', 'City of London', 'Enfield', 'Ealing', 'Greenwich', 'Hackney', 'Haringey', 'Hammersmith and Fulham', 'Harrow', 'Havering', 'Hounslow', 'Hillingdon', 'Kensington and Chelsea', 'Islington', 'Kingston upon Thames', 'Lewisham', 'Lambeth', 'Redbridge', 'Newham', 'Merton', 'London Legacy DC', 'Southwark', 'Sutton', 'Richmond upon Thames', 'Waltham Forest', 'Wandsworth', 'Tower Hamlets', 'Westminster')`

Out[69]:
<matplotlib.axes._subplots.AxesSubplot at 0x10a86fc18>



`Number of Industrial Developments per CNN Classifier Label Mix by Selected Boroughs`

In [ ]:
thumbs('df_cat_indust_1000_train', 'Generic Industrial', 100, 2)
thumbs('df_cat_indust_major_1000_train', 'Major Industrial', 100, 2)
thumbs('df_cat_indust_minor_1000_train', 'Minor Industrial', 100, 2)

Appendix 2.9 - Create Simple Building Type Label Set:

For troubleshootting overfitting and noisy image label issues. Aim = Obtain higher accuracy with Classification Task but with simplified detail in the image dataset.

In [ ]:
df_cat_commercial_office['FloorSpace'].describe()
df_cat_commercial_retail['FloorSpace'].describe()
df_cat_indust['FloorSpace'].describe()
df_cat_flats['Proposed_Total_Residential_Units'].describe()

Simple Building Types Criteria

  • House Class = Detached Only --> House Structure
  • Flats Class = > 20 Units Only --> Block of Flats
  • Office Class = > 1000 Sq Ft + No Live/Work --> Tall Offices
  • Industrial Class = > 100 Sq ft -- > Site on an Industrial Estate
  • Retail --> Shopping Centre No Terraces/High Streets

200 + Sample Size

In [ ]:
# Quick Test for finding stronger types
#df_cat_flats['Total_Site_Area_Prop_ha'].describe()
#df_cat_flats_major_floorspace = df_cat_flats[df_cat_flats['Total_Site_Area_Prop_ha'] > 2]
#.sample(n=200, random_state=1)
#len(df_cat_flats_major_floorspace)
#df_cat_flats['Cash_in_Lieu_Affordable_Housing'].describe()
#df_cat_flats_major_floorspace = df_cat_flats[df_cat_flats['Cash_in_Lieu_Affordable_Housing'] > '200']
#len(df_cat_flats_major_floorspace)
#thumbs("df_cat_flats_major_floorspace", "Major Development", 100, 10)
#df_cat_flats_major_floorspace['Proposed_Total_Floorspace'].describe()

#Flats
df_cat_flats_major_floorspace = df_cat_flats[df_cat_flats['Proposed_Total_Residential_Units'] > 20].sample(n=200, random_state=1)
len(df_cat_flats_major_floorspace)
#thumbs("df_cat_flats_major_floorspace", "Simple Building Types", 200, 2)
#df_cat_flats_major_floorspace['Proposed_Total_Floorspace'].describe()

#House
df_cat_house_major_floorspace = df_ldd_Filter1[df_ldd_Filter1['ClassificationCode'].str.startswith('RD02')].sample(n=200, random_state=1)
len(df_cat_house_major_floorspace)
thumbs("df_cat_house_major_floorspace", "Simple Building Types", 200, 2)

#Terraces
df_cat_terraces_major_floorspace = df_ldd_Filter1[df_ldd_Filter1['ClassificationCode'].str.startswith('RD04')].sample(n=200, random_state=1)
len(df_cat_terraces_major_floorspace)
thumbs("df_cat_terraces_major_floorspace", "Simple Building Types", 200, 2)

#Office
df_cat_commercial_office_major_floorspace = df_ldd_Filter1[df_ldd_Filter1['ClassificationCode'].str.startswith('CO')] 
                                                          
df_cat_commercial_office_major_floorspace = df_cat_commercial_office_major_floorspace[df_cat_commercial_office_major_floorspace['Proposed_Total_Floorspace'] > 1000].sample(n=200, random_state=1) #, replace=True)
len(df_cat_commercial_office_major_floorspace)
#df_cat_commercial_office_major_floorspace.head(10)
#thumbs("df_cat_commercial_office_major_floorspace", "Simple Building Types", 200, 2)

#CR02PO = Post Office, 
#Restaurant 
#Retail
df_cat_commercial_retail_major_floorspace = df_ldd_Filter1[df_ldd_Filter1['ClassificationCode'].str.startswith('CR07') |
                                                          df_ldd_Filter1['ClassificationCode'].str.startswith('CR02PO')
                                                                          ].sample(n=200, random_state=1)
#df_cat_commercial_retail_major_floorspace = df_cat_commercial_retail[df_cat_commercial_retail['Cash_in_Lieu_Affordable_Housing'] > '10'].sample(n=100, random_state=1, replace=True)
len(df_cat_commercial_retail_major_floorspace)
#thumbs("df_cat_commercial_retail_major_floorspace", "Simple Building Types", 200, 2)

#df_cat_commercial_indust_major_floorspace
df_cat_commercial_indust_major_floorspace = df_ldd_Filter1[df_ldd_Filter1['ClassificationCode'].str.startswith('CI04') |
                                                          df_ldd_Filter1['ClassificationCode'].str.startswith('CI01')
                                                                          ].sample(n=200, random_state=1)
#df_cat_commercial_indust_major_floorspace = df_cat_indust_1000_train[df_cat_indust_1000_train['Proposed_Total_Floorspace'] > 100].sample(n=100, random_state=1)
len(df_cat_commercial_indust_major_floorspace)
#thumbs("df_cat_commercial_indust_major_floorspace", "Simple Building Types", 200, 2)

#df_cat_house_major_floorspace
#df_cat_flats_major_floorspace
#df_cat_commercial_office_major_floorspace
#df_cat_commercial_retail_major_floorspace
#df_cat_commercial_indust_major_floorspace 
#df_cat_terraces_major_floorspace

msk = np.random.rand(len(df_cat_commercial_indust_major_floorspace)) < 0.80
df_cat_commercial_indust_major_floorspace_train = df_cat_commercial_indust_major_floorspace[msk]   
df_cat_commercial_indust_major_floorspace_test = df_cat_commercial_indust_major_floorspace[~msk]

msk = np.random.rand(len(df_cat_commercial_retail_major_floorspace)) < 0.80
df_cat_commercial_retail_major_floorspace_train = df_cat_commercial_retail_major_floorspace[msk]   
df_cat_commercial_retail_major_floorspace_test = df_cat_commercial_retail_major_floorspace[~msk]

msk = np.random.rand(len(df_cat_commercial_office_major_floorspace)) < 0.80
df_cat_commercial_office_major_floorspace_train = df_cat_commercial_office_major_floorspace[msk]   
df_cat_commercial_office_major_floorspace_test = df_cat_commercial_office_major_floorspace[~msk]

msk = np.random.rand(len(df_cat_terraces_major_floorspace)) < 0.80
df_cat_terraces_major_floorspace_train = df_cat_terraces_major_floorspace[msk]   
df_cat_terraces_major_floorspace_test = df_cat_terraces_major_floorspace[~msk]

msk = np.random.rand(len(df_cat_house_major_floorspace)) < 0.80
df_cat_house_major_floorspace_train = df_cat_house_major_floorspace[msk]   
df_cat_house_major_floorspace_test = df_cat_house_major_floorspace[~msk]

msk = np.random.rand(len(df_cat_flats_major_floorspace)) < 0.80
df_cat_flats_major_floorspace_train = df_cat_flats_major_floorspace[msk]   
df_cat_flats_major_floorspace_test = df_cat_flats_major_floorspace[~msk]

1000 + Sample Size

In [72]:
#Flats
len(df_cat_flats[df_cat_flats['Proposed_Total_Residential_Units'] > 20])

#House
len(df_cat_commercial_office_major_floorspace[df_cat_commercial_office_major_floorspace['Proposed_Total_Floorspace'] > 750])

#Terrace
len(df_ldd_Filter1[df_ldd_Filter1['ClassificationCode'].str.startswith('RD04')])

#Retail
len(df_ldd_Filter1[df_ldd_Filter1['ClassificationCode'].str.startswith('CR07') |
                                                          df_ldd_Filter1['ClassificationCode'].str.startswith('CR02PO')
                                                                          ])
len(df_ldd_Filter1[df_ldd_Filter1['ClassificationCode'].str.startswith('CI04') |
                                                          df_ldd_Filter1['ClassificationCode'].str.startswith('CI01')
                                                                          ])
Out[72]:
793
Out[72]:
200
Out[72]:
3563
Out[72]:
318
Out[72]:
255
In [73]:
# Quick Test for finding stronger types
#df_cat_flats['Total_Site_Area_Prop_ha'].describe()
#df_cat_flats_major_floorspace = df_cat_flats[df_cat_flats['Total_Site_Area_Prop_ha'] > 2]
#.sample(n=750, random_state=1)
#len(df_cat_flats_major_floorspace)
#df_cat_flats['Cash_in_Lieu_Affordable_Housing'].describe()
#df_cat_flats_major_floorspace = df_cat_flats[df_cat_flats['Cash_in_Lieu_Affordable_Housing'] > '750']
#len(df_cat_flats_major_floorspace)
#thumbs("df_cat_flats_major_floorspace", "Major Development", 100, 10)
#df_cat_flats_major_floorspace['Proposed_Total_Floorspace'].describe()

#Flats
df_cat_flats_major_floorspace = df_cat_flats[df_cat_flats['Proposed_Total_Residential_Units'] > 20].sample(n=750, random_state=1)
len(df_cat_flats_major_floorspace)
#thumbs("df_cat_flats_major_floorspace", "Simple Building Types", 750, 2)
#df_cat_flats_major_floorspace['Proposed_Total_Floorspace'].describe()

#House
df_cat_house_major_floorspace = df_ldd_Filter1[df_ldd_Filter1['ClassificationCode'].str.startswith('RD02')].sample(n=750, random_state=1)
len(df_cat_house_major_floorspace)
#thumbs("df_cat_house_major_floorspace", "Simple Building Types", 750, 2)

#Terraces
df_cat_terraces_major_floorspace = df_ldd_Filter1[df_ldd_Filter1['ClassificationCode'].str.startswith('RD04')].sample(n=750, random_state=1)
len(df_cat_terraces_major_floorspace)
#thumbs("df_cat_terraces_major_floorspace", "Simple Building Types", 750, 2)

#Office
df_cat_commercial_office_major_floorspace = df_ldd_Filter1[df_ldd_Filter1['ClassificationCode'].str.startswith('CO')] 
                                                          
df_cat_commercial_office_major_floorspace = df_cat_commercial_office_major_floorspace[df_cat_commercial_office_major_floorspace['Proposed_Total_Floorspace'] > 29].sample(n=290, random_state=1) #, replace=True)
len(df_cat_commercial_office_major_floorspace)
#df_cat_commercial_office_major_floorspace.head(10)
#thumbs("df_cat_commercial_office_major_floorspace", "Simple Building Types", 290, 2)

#CR02PO = Post Office, 
#Restaurant 
#Retail
df_cat_commercial_retail_major_floorspace = df_ldd_Filter1[df_ldd_Filter1['ClassificationCode'].str.startswith('CR07') |
                                                          df_ldd_Filter1['ClassificationCode'].str.startswith('CR02PO')
                                                                          ].sample(n=310, random_state=1)
#df_cat_commercial_retail_major_floorspace = df_cat_commercial_retail[df_cat_commercial_retail['Cash_in_Lieu_Affordable_Housing'] > '10'].sample(n=100, random_state=1, replace=True)
len(df_cat_commercial_retail_major_floorspace)
#thumbs("df_cat_commercial_retail_major_floorspace", "Simple Building Types", 310, 2)

#df_cat_commercial_indust_major_floorspace
df_cat_commercial_indust_major_floorspace = df_ldd_Filter1[df_ldd_Filter1['ClassificationCode'].str.startswith('CI04') |
                                                          df_ldd_Filter1['ClassificationCode'].str.startswith('CI01')
                                                                          ].sample(n=250, random_state=1)
#df_cat_commercial_indust_major_floorspace = df_cat_indust_750_train[df_cat_indust_750_train['Proposed_Total_Floorspace'] > 100].sample(n=100, random_state=1)
len(df_cat_commercial_indust_major_floorspace)
#thumbs("df_cat_commercial_indust_major_floorspace", "Simple Building Types", 250, 2)

#df_cat_house_major_floorspace
#df_cat_flats_major_floorspace
#df_cat_commercial_office_major_floorspace
#df_cat_commercial_retail_major_floorspace
#df_cat_commercial_indust_major_floorspace 
#df_cat_terraces_major_floorspace

msk = np.random.rand(len(df_cat_commercial_indust_major_floorspace)) < 0.80
df_cat_commercial_indust_major_floorspace_train = df_cat_commercial_indust_major_floorspace[msk]   
df_cat_commercial_indust_major_floorspace_test = df_cat_commercial_indust_major_floorspace[~msk]

msk = np.random.rand(len(df_cat_commercial_retail_major_floorspace)) < 0.80
df_cat_commercial_retail_major_floorspace_train = df_cat_commercial_retail_major_floorspace[msk]   
df_cat_commercial_retail_major_floorspace_test = df_cat_commercial_retail_major_floorspace[~msk]

msk = np.random.rand(len(df_cat_commercial_office_major_floorspace)) < 0.80
df_cat_commercial_office_major_floorspace_train = df_cat_commercial_office_major_floorspace[msk]   
df_cat_commercial_office_major_floorspace_test = df_cat_commercial_office_major_floorspace[~msk]

msk = np.random.rand(len(df_cat_terraces_major_floorspace)) < 0.80
df_cat_terraces_major_floorspace_train = df_cat_terraces_major_floorspace[msk]   
df_cat_terraces_major_floorspace_test = df_cat_terraces_major_floorspace[~msk]

msk = np.random.rand(len(df_cat_house_major_floorspace)) < 0.80
df_cat_house_major_floorspace_train = df_cat_house_major_floorspace[msk]   
df_cat_house_major_floorspace_test = df_cat_house_major_floorspace[~msk]

msk = np.random.rand(len(df_cat_flats_major_floorspace)) < 0.80
df_cat_flats_major_floorspace_train = df_cat_flats_major_floorspace[msk]   
df_cat_flats_major_floorspace_test = df_cat_flats_major_floorspace[~msk]
Out[73]:
750
Out[73]:
750
Out[73]:
750
Out[73]:
290
Out[73]:
310
Out[73]:
250

Lookup Image/Building Master Data(to troubleshoot Address Type Anomalies)

In [147]:
# Select Fields of Choice Only
#df_ldd_Filter1['TempID'][df_ldd_Filter1['TempID'] == 62770]
df_ldd_Filter1[df_ldd_Filter1['TempID'] == 20092]
Out[147]:
TempID Planning_A Resi_Site_Prop_ha Non_Resi_Site_Prop_ha Total_Open_ Space_Exist_ha Total_Open_ Space_Prop_ha Total_Site_Area_Prop_ha Existing_Total_Residential_Units Proposed_Total_Residential_Units Proposed_TotalAffordable_Units Proposed_Total_Affordable_Percentage Proposed_Residential_Parking_Spaces Cash_in_Lieu_Affordable_Housing Existing_Total_Bedrooms Proposed_Total_Bedrooms Existing_Total_Floorspace Proposed_Total_Floorspace ClassificationCode Latitude Longitude Developmen Primary_St Site_Name_ Postcode_Join Concatenated Class_Desc Primary_Code Secondary_Code Tertiary_Code Quaternary_Code Primary_Desc Secondary_Desc Tertiary_Desc Quaternary_Desc Unnamed: 0 Postcode Large User Deleted Acorn Category Acorn Group Acorn Type Description class_label
12924 20092 Enfield 0.017 0.0 0.0 0.0 0.017 0 1 0 0 2 0 0 0 0 RD02 51.656351 -0.051446 Subdivision of site and erection of an end of terrace 3-bed single family dwelling house together with a single storey rear extension to existing house. Mapleton Road 13 EN13PE RD02 Detached R D 2.0 NaN Residential Dwelling Detached NaN 771603.0 EN 1 3PE 0.0 0.0 5.0 P 53.0 Low income terraces RD

Note:

Appendix 2.10 - Create Hybrid Building Type Groups:

For troubleshootting overfitting and noisy image label issues. Aim = Obtain higher accuracy with Classification Task but with simplified detail in the image dataset.

In [ ]:
#Flats and Office as Tower Block class attempt 
# Exluding Semi Detached as causes Issues
In [ ]:
# FOR NOW - JUST COPY ANAD PASTE COMBO FOLDERS

#Label Run 72

Appendix 2.11 - Create Building Type Misfits Class:

For troubleshootting overfitting and noisy image label issues. Aim = Obtain higher accuracy with Classification Task but with simplified detail in the image dataset.

See One vs Rest Label Run

Appendix 2.12 - Create One vs Rest Label Set:

For troubleshootting overfitting and noisy image label issues. Aim = Obtain higher accuracy with Classification Task but with simplified detail in its Labelling Domain.

In [ ]:
#More Data
# FOR NOW>> GET CLEAN FLATS OR OTHER ONE CLASS FOLDRE
# COPY THE REST INTO A "THE REST" FOLDER

Appendix 2.13 - Places 365 Re-Processed Re-Joined Runs

Process Overview:

Create 1000+ Image Sets >> Processing Classification on Places365 on Dev Machine >>Process Munge File on Excel>>Import Places365Munge File >>Join and Filter on Inclusive Places 365 Types >>Save to New Folder

In [50]:
#df_labeled_join = pd.read_csv('munge_in.csv',  sep=',', 
#                           error_bad_lines=False, index_col=False,  na_values=['.'], encoding="ISO-8859-1")
#df_labeled_join = pd.read_csv('/Users/anthonysutton/ml2/_CASA_Project_Files/_LDD_CNN_FINAL_VERSION/places_post_filter.csv',  sep=',', 
#                           error_bad_lines=False, index_col=False,  na_values=['.'], encoding="ISO-8859-1")

df_labeled_join = pd.read_csv('/Users/anthonysutton/ml2/thurs_port/post_munge_1000.csv',  sep=',', 
                           error_bad_lines=False, index_col=False,  na_values=['.'], encoding="ISO-8859-1")

#df_labeled_join = pd.read_csv('/Users/anthonysutton/ml2/_CASA_Project_Files/_LDD_CNN_FINAL_VERSION/ldd_image_munge_4_edit.csv',  sep=',', 
#                              error_bad_lines=False, index_col=False,  na_values=['.'], encoding="ISO-8859-1")
 
##df_labeled_join = pd.read_csv('/Users/anthonysutton/ml2/_CASA_Project_Files/_LDD_CNN_FINAL_VERSION/ldd_image_mungeSun_.csv',  sep=',', 
#                           error_bad_lines=False, index_col=False,  na_values=['.'], encoding="ISO-8859-1")


datum = df_labeled_join.groupby(' Category').filter(lambda x : len(x)>10)
l=datum.groupby(' Category').size()
l.sort_values()
Out[50]:
 Category
schoolhouse                   11 
ice_skating_rink/outdoor      12 
volleyball_court/outdoor      13 
atrium/public                 13 
medina                        13 
kasbah                        13 
hangar/indoor                 13 
runway                        15 
synagogue/outdoor             16 
parking_garage/indoor         16 
  dam                         16 
pharmacy                      17 
fastfood_restaurant           18 
junkyard                      20 
phone_booth                   21 
garage/indoor                 21 
patio                         22 
balcony/interior              22 
mansion                       22 
barndoor                      23 
heliport                      23 
playground                    24 
ski_resort                    25 
amphitheater                  26 
courthouse                    27 
fire_escape                   28 
lock_chamber                  32 
mezzanine                     33 
doorway/outdoor               34 
kennel/outdoor                34 
beach_house                   36 
 slum                         41 
oast_house                    50 
street                        53 
ticket_booth                  54 
construction_site             58 
promenade                     72 
library/outdoor               74 
plaza                         74 
gas_station                   76 
balcony/exterior              87 
driveway                      95 
hospital                      104
shopfront                     112
building_facade               117
fire_station                  127
hangar/outdoor                131
manufactured_home             141
garage/outdoor                147
courtyard                     161
inn/outdoor                   184
hotel/outdoor                 205
apartment_building/outdoor    229
house                         309
general_store/outdoor         320
industrial_area               414
embassy                       416
parking_lot                   458
loading_dock                  463
parking_garage/outdoor        688
residential_neighborhood      765
motel                         815
dtype: int64
In [ ]:
# Code For Exporting Thumb Gallery to HTML File
data, metadata = get_ipython().display_formatter.format(HTML(gallery))
#data = 'test'
with open('table.html', 'w') as f:
    f.write(data['text/html'])  # Assuming the object has an HTML representation
In [51]:
#df_labeled_join.head()
df_labeled_join.shape
Out[51]:
(7997, 4)
In [52]:
fig=plt.figure(figsize=(16,8))
plt.yticks(fontsize=8)
l.plot(kind='bar',fontsize=12,color='k')  
plt.xlabel('',)
plt.ylabel('Number of ...',fontsize=10)
plt.show()
Out[52]:
(array([0. , 0.2, 0.4, 0.6, 0.8, 1. ]), <a list of 6 Text yticklabel objects>)
Out[52]:
<matplotlib.axes._subplots.AxesSubplot at 0x1078b1550>
Out[52]:
Text(0.5,0,'')
Out[52]:
Text(0,0.5,'Number of ...')

Visual Inspection of Image Data by Places 365 Categorisation

In [53]:
df_labeled_join_365munged = df_labeled_join
In [54]:
df_labeled_join_365munged.head(10)
Out[54]:
TempID Category Image Prob
0 22860 hospital /Users/banana/Documents/ML4/1000_Labels/TRAIN1000/df_cat_indust_minor_1000_train/col_id_22860.jpg NaN
1 56844 industrial_area /Users/banana/Documents/ML4/1000_Labels/TRAIN1000/df_cat_indust_minor_1000_train/col_id_56844.jpg NaN
2 59008 parking_lot /Users/banana/Documents/ML4/1000_Labels/TRAIN1000/df_cat_indust_minor_1000_train/col_id_59008.jpg NaN
3 52613 kennel/outdoor /Users/banana/Documents/ML4/1000_Labels/TRAIN1000/df_cat_indust_minor_1000_train/col_id_52613.jpg NaN
4 52820 general_store/outdoor /Users/banana/Documents/ML4/1000_Labels/TRAIN1000/df_cat_indust_minor_1000_train/col_id_52820.jpg NaN
5 53445 inn/outdoor /Users/banana/Documents/ML4/1000_Labels/TRAIN1000/df_cat_indust_minor_1000_train/col_id_53445.jpg NaN
6 22519 loading_dock /Users/banana/Documents/ML4/1000_Labels/TRAIN1000/df_cat_indust_minor_1000_train/col_id_22519.jpg NaN
7 28236 boathouse /Users/banana/Documents/ML4/1000_Labels/TRAIN1000/df_cat_indust_minor_1000_train/col_id_28236.jpg NaN
8 47183 garage/outdoor /Users/banana/Documents/ML4/1000_Labels/TRAIN1000/df_cat_indust_minor_1000_train/col_id_47183.jpg NaN
9 34145 driveway /Users/banana/Documents/ML4/1000_Labels/TRAIN1000/df_cat_indust_minor_1000_train/col_id_34145.jpg NaN
In [ ]:
#printmd('<br><br><b>`' + title +'`</b> &#9663;<br><br>')
#the_frame = eval(d_frame)
#print(the_frame.shape)
#imagey = df_ldd_Filter1['TempID'].sample(n=1)
size = 2        # sample size
size = 1
replace = True  # with replacement
# Ensure we have n>100 Labels 
thumbsdf = df_labeled_join_365munged.groupby(' Category').filter(lambda x : len(x)>size)
fn = lambda obj: obj.loc[np.random.choice(obj.index, size, replace),:]
thumb_gally = thumbsdf.groupby(' Category', as_index=False).apply(fn)
thumb_gally2 = thumb_gally[['TempID', ' Category', ' Image']].values
gallery = ""

for count, element in enumerate(thumb_gally2, 1):   
   
    if thumb_gally2[count-1,0] <= 29192:
        inp2 = '../../_STREETVIEW_EXPLORER/LDD_Complete/col_id_' + str(thumb_gally2[count-1,0]) +'.jpg'
    else:
        inp2 = '../../_STREETVIEW_EXPLORER/LDD_Complete/batch2/col_id_' + str(thumb_gally2[count-1,0]) +'.jpg'

    inp3 = "<div class='gallery'><div class='zoom'><img src='" +  inp2 + "' width='' height=''></div><div class='desc'>" + "Cat" + \
             ": " + str(thumb_gally2[count-1,1]) + " Image id: " + str(thumb_gally2[count-1,0]) +  "</div></div>"
    gallery = gallery + inp3

display(HTML(gallery)) 

Note: Open File in Excel and Filter on Relevant Categories, Save File and Re Open

In [ ]:
df_labeled_join_365munged = df_labeled_join
df_labeled_join_365munged.shape
#mylist = ['manufactu', 'kennel', 'parking_lot']
mylist = ['apartment', 'church', 'house', 'industrial area', 'museum', 'building facade', 'embassy', 'hospital', 'parking garage', 'hotel']
#mylist_2 = 
pattern = '|'.join(mylist)
df_labeled_join_365munged = df_labeled_join_365munged[df_labeled_join_365munged[' Category'].str.contains(pattern)]

#file_name = df_labeled_join_365munged[' Image'].str[28:]
#print(file_name)
df_labeled_join_365munged.shape
#df_labeled_join_365munged.describe()

Join Master Dataset to Places 365 Munged

In [55]:
df_labeled_join_365munged_join = pd.merge(df_ldd_Filter1, df_labeled_join_365munged, on='TempID' , how='inner')
df_labeled_join_365munged_join.shape
Out[55]:
(3316, 46)
In [56]:
df_labeled_join_365munged.head(20)
Out[56]:
TempID Category Image Prob
0 65097 embassy /Users/banana/Documents/ML4/Label_Store/July18/TEST1000/df_cat_flats_1000_test/col_id_65097.jpg 0.404
1 36008 house /Users/banana/Documents/ML4/Label_Store/July18/TEST1000/df_cat_flats_1000_test/col_id_36008.jpg 0.228
2 44925 house /Users/banana/Documents/ML4/Label_Store/July18/TEST1000/df_cat_flats_1000_test/col_id_44925.jpg 0.092
3 32720 industrial_area /Users/banana/Documents/ML4/Label_Store/July18/TEST1000/df_cat_flats_1000_test/col_id_32720.jpg 0.248
4 59618 house /Users/banana/Documents/ML4/Label_Store/July18/TEST1000/df_cat_flats_1000_test/col_id_59618.jpg 0.155
5 38379 embassy /Users/banana/Documents/ML4/Label_Store/July18/TEST1000/df_cat_flats_1000_test/col_id_38379.jpg 0.485
6 24224 embassy /Users/banana/Documents/ML4/Label_Store/July18/TEST1000/df_cat_flats_1000_test/col_id_24224.jpg 0.178
7 2869 embassy /Users/banana/Documents/ML4/Label_Store/July18/TEST1000/df_cat_flats_1000_test/col_id_2869.jpg 0.150
8 23789 hospital /Users/banana/Documents/ML4/Label_Store/July18/TEST1000/df_cat_flats_1000_test/col_id_23789.jpg 0.117
9 34183 house /Users/banana/Documents/ML4/Label_Store/July18/TEST1000/df_cat_flats_1000_test/col_id_34183.jpg 0.161
10 61236 hotel/outdoor /Users/banana/Documents/ML4/Label_Store/July18/TEST1000/df_cat_flats_1000_test/col_id_61236.jpg 0.215
11 60316 building_facade /Users/banana/Documents/ML4/Label_Store/July18/TEST1000/df_cat_flats_1000_test/col_id_60316.jpg 0.170
12 64170 embassy /Users/banana/Documents/ML4/Label_Store/July18/TEST1000/df_cat_flats_1000_test/col_id_64170.jpg 0.352
13 29558 hospital /Users/banana/Documents/ML4/Label_Store/July18/TEST1000/df_cat_flats_1000_test/col_id_29558.jpg 0.231
14 26044 embassy /Users/banana/Documents/ML4/Label_Store/July18/TEST1000/df_cat_flats_1000_test/col_id_26044.jpg 0.187
15 41391 house /Users/banana/Documents/ML4/Label_Store/July18/TEST1000/df_cat_flats_1000_test/col_id_41391.jpg 0.220
16 52886 building_facade /Users/banana/Documents/ML4/Label_Store/July18/TEST1000/df_cat_flats_1000_test/col_id_52886.jpg 0.090
17 13789 house /Users/banana/Documents/ML4/Label_Store/July18/TEST1000/df_cat_flats_1000_test/col_id_13789.jpg 0.412
18 12642 parking_garage/outdoor /Users/banana/Documents/ML4/Label_Store/July18/TEST1000/df_cat_flats_1000_test/col_id_12642.jpg 0.525
19 57647 industrial_area /Users/banana/Documents/ML4/Label_Store/July18/TEST1000/df_cat_flats_1000_test/col_id_57647.jpg 0.303
In [ ]:
df_labeled_join_365munged_join.to_csv('munged_checker_1.csv', index=False)
In [76]:
#House
#df_cat_house = df_ldd_Filter1[df_ldd_Filter1['ClassificationCode'].str.startswith('RD02')]


#Terraces
df_cat_terraces = df_ldd_Filter1[df_ldd_Filter1['ClassificationCode'].str.startswith('RD04')]
df_cat_terraces.shape

df_cat_house = df_ldd_Filter1[(df_ldd_Filter1['ClassificationCode'].str.startswith('RD02')) | 
                              (df_ldd_Filter1['ClassificationCode'].str.startswith('RD03')) |
                              (df_ldd_Filter1['ClassificationCode'].str.startswith('RD04'))]
#.sample(n=300, random_state=1)

df_cat_house.shape

df_cat_flats = df_ldd_Filter1[(df_ldd_Filter1['ClassificationCode'].str.startswith('RD06'))]
df_cat_flats = df_cat_flats[df_cat_flats['Proposed_Total_Residential_Units'] > 10]
#.sample(n=300, random_state=1)
df_cat_flats.shape


df_cat_commercial_office = df_ldd_Filter1[(df_ldd_Filter1['ClassificationCode'].str.startswith('CO'))]
df_cat_commercial_office['FloorSpace'] = df_cat_commercial_office['Existing_Total_Floorspace'] + df_cat_commercial_office['Proposed_Total_Floorspace']
df_cat_commercial_office = df_cat_commercial_office[df_cat_commercial_office['FloorSpace'] > 80]
#.sample(n=300, random_state=1)
df_cat_commercial_office.shape

df_cat_commercial_retail = df_ldd_Filter1[(df_ldd_Filter1['ClassificationCode'].str.startswith('CR'))]
df_cat_commercial_retail['FloorSpace'] = df_cat_commercial_retail['Existing_Total_Floorspace'] + df_cat_commercial_retail['Proposed_Total_Floorspace']
df_cat_commercial_retail = df_cat_commercial_retail[df_cat_commercial_retail['FloorSpace'] > 1000]
#.sample(n=300, random_state=1)
df_cat_commercial_retail.shape


df_cat_indust = df_ldd_Filter1[(df_ldd_Filter1['ClassificationCode'].str.startswith('CI'))]
df_cat_indust['FloorSpace'] = df_cat_indust['Existing_Total_Floorspace'] + df_cat_indust['Proposed_Total_Floorspace']
df_cat_indust =df_cat_indust[df_cat_indust['FloorSpace'] > 1000]
#.sample(n=220, random_state=1)
df_cat_indust.shape
Out[76]:
(3563, 43)
Out[76]:
(8090, 43)
Out[76]:
(1379, 43)
/Users/anthonysutton/ml2/env/lib/python3.6/site-packages/ipykernel_launcher.py:23: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
Out[76]:
(1026, 44)
/Users/anthonysutton/ml2/env/lib/python3.6/site-packages/ipykernel_launcher.py:29: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
Out[76]:
(309, 44)
/Users/anthonysutton/ml2/env/lib/python3.6/site-packages/ipykernel_launcher.py:36: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
Out[76]:
(228, 44)
In [ ]:
#Flats
len(df_cat_flats[df_cat_flats['Proposed_Total_Residential_Units'] > 20])

#House
#len(df_cat_commercial_office_major_floorspace[df_cat_commercial_office_major_floorspace['Proposed_Total_Floorspace'] > 750])

len(df_cat_house)
#Terrace
len(df_ldd_Filter1[df_ldd_Filter1['ClassificationCode'].str.startswith('RD04')])

#Retail
len(df_ldd_Filter1[df_ldd_Filter1['ClassificationCode'].str.startswith('CR07') |
                                                          df_ldd_Filter1['ClassificationCode'].str.startswith('CR02PO')
                                                                          ])
len(df_ldd_Filter1[df_ldd_Filter1['ClassificationCode'].str.startswith('CI04') |
                                                          df_ldd_Filter1['ClassificationCode'].str.startswith('CI01')
                                                                          ])
In [91]:
#6 Aug 19 - !000, Munged but not Filtered
df_labeled_join_365munged_house_join = pd.merge(df_cat_house, df_labeled_join_365munged, on='TempID' , 
                                                how='outer').sample(n=660, random_state=1)
#df_labeled_join_365munged_house_join.shape

#Houses Cheat
#df_labeled_join_365munged_house_join = df_ldd_Filter1[(df_ldd_Filter1['ClassificationCode'].str.startswith('RD02')) | 
#                              (df_ldd_Filter1['ClassificationCode'].str.startswith('RD03')) |
#                              (df_ldd_Filter1['ClassificationCode'].str.startswith('RD04'))].sample(n=650, random_state=1)

df_labeled_join_365munged_house_join.shape

#df_labeled_join_365munged_terrace_join = pd.merge(df_cat_terraces, df_labeled_join_365munged, on='TempID' , how='inner')
#.sample(n=1000, random_state=1)
#df_labeled_join_365munged_terrace_join.shape

df_labeled_join_365munged_flat_join = pd.merge(df_cat_flats, df_labeled_join_365munged, on='TempID' , 
                                               how='inner')
#.sample(n=800, random_state=1)
df_labeled_join_365munged_flat_join.shape

df_labeled_join_365munged_office_join = pd.merge(df_cat_commercial_office, df_labeled_join_365munged, 
                                                 on='TempID' , how='inner')
#.sample(n=800, random_state=1)
df_labeled_join_365munged_office_join.shape

#df_labeled_join_365munged_retail_join = pd.merge(df_cat_commercial_retail, df_labeled_join_365munged, 
#                                                 on='TempID' , how='inner')
#.sample(n=150, random_state=1)
#df_labeled_join_365munged_retail_join.shape

#df_labeled_join_365munged_indust_join = pd.merge(df_cat_indust, df_labeled_join_365munged, on='TempID' , 
#                                                 how='inner')
#.sample(n=150, random_state=1)
#df_labeled_join_365munged_indust_join.shape
Out[91]:
(660, 46)
Out[91]:
(668, 46)
Out[91]:
(652, 47)
In [ ]:
#out = df_labeled_join_365munged_house_join.sort_values('TempID')
#out
out = df_cat_house.sort_values('TempID')
out
In [81]:
#Test and Train Split

msk = np.random.rand(len(df_labeled_join_365munged_indust_join)) < 0.95
df_labeled_join_365munged_indust_train = df_labeled_join_365munged_indust_join[msk]   
df_labeled_join_365munged_indust_test = df_labeled_join_365munged_indust_join[~msk]

msk = np.random.rand(len(df_labeled_join_365munged_retail_join)) < 0.95
df_labeled_join_365munged_retail_join_train = df_labeled_join_365munged_retail_join[msk]   
df_labeled_join_365munged_retail_join_test = df_labeled_join_365munged_retail_join[~msk]

msk = np.random.rand(len(df_labeled_join_365munged_office_join)) < 0.95
df_labeled_join_365munged_office_join_train = df_labeled_join_365munged_office_join[msk]   
df_labeled_join_365munged_office_join_test = df_labeled_join_365munged_office_join[~msk]

#msk = np.random.rand(len(df_labeled_join_365munged_terrace_join)) < 0.85
#df_labeled_join_365munged_terrace_join_train = df_labeled_join_365munged_terrace_join[msk]   
#df_labeled_join_365munged_terrace_join_test = df_labeled_join_365munged_terrace_join[~msk]

msk = np.random.rand(len(df_labeled_join_365munged_house_join)) < 0.95
df_labeled_join_365munged_house_join_train = df_labeled_join_365munged_house_join[msk]   
df_labeled_join_365munged_house_join_test = df_labeled_join_365munged_house_join[~msk]

msk = np.random.rand(len(df_labeled_join_365munged_flat_join)) < 0.95
df_labeled_join_365munged_flat_join_train = df_labeled_join_365munged_flat_join[msk]   
df_labeled_join_365munged_flat_join_test = df_labeled_join_365munged_flat_join[~msk]
In [ ]:
#df_labeled_join_365munged_house_join
#df_labeled_join_365munged_flat_join
#df_labeled_join_365munged_office_join
#df_labeled_join_365munged_retail_join
#df_labeled_join_365munged_indust_join


#printmd('<br><br><b>`' + title +'`</b> &#9663;<br><br>')
#the_frame = eval(d_frame)
#print(the_frame.shape)
#imagey = df_ldd_Filter1['TempID'].sample(n=1)
size = 2        # sample size
size = 2
replace = True  # with replacement
# Ensure we have n>100 Labels 
thumbsdf = df_labeled_join_365munged_office_join.groupby(' Category').filter(lambda x : len(x)>size)
fn = lambda obj: obj.loc[np.random.choice(obj.index, size, replace),:]
thumb_gally = thumbsdf.groupby(' Category', as_index=False).apply(fn)
thumb_gally2 = thumb_gally[['TempID', ' Category', ' Image']].values
gallery = ""

for count, element in enumerate(thumb_gally2, 1):   
   
    if thumb_gally2[count-1,0] <= 29192:
        inp2 = '../../_STREETVIEW_EXPLORER/LDD_Complete/col_id_' + str(thumb_gally2[count-1,0]) +'.jpg'
    else:
        inp2 = '../../_STREETVIEW_EXPLORER/LDD_Complete/batch2/col_id_' + str(thumb_gally2[count-1,0]) +'.jpg'

    inp3 = "<div class='gallery'><div class='zoom'><img src='" +  inp2 + "' width='' height=''></div><div class='desc'>" + "Cat" + \
             ": " + str(thumb_gally2[count-1,1]) + " Image id: " + str(thumb_gally2[count-1,0]) +  "</div></div>"
    gallery = gallery + inp3

display(HTML(gallery)) 

Appendix 2.14 - Sample Size N = 2000

Explanation of the adopted 2000+ Sample Size approach: Mega Mix. Get as Many Labels as You can!

HOUSE

In [39]:
df_cat_house = df_ldd_Filter1[(df_ldd_Filter1['ClassificationCode'].str.startswith('RD02')) | 
                              (df_ldd_Filter1['ClassificationCode'].str.startswith('RD03'))
                              | (df_ldd_Filter1['ClassificationCode'].str.startswith('RD04'))  ]

df_cat_house.shape
df_cat_house.describe()
Out[39]:
(8090, 43)
Out[39]:
TempID Resi_Site_Prop_ha Non_Resi_Site_Prop_ha Total_Open_ Space_Exist_ha Total_Open_ Space_Prop_ha Total_Site_Area_Prop_ha Existing_Total_Residential_Units Proposed_Total_Residential_Units Proposed_TotalAffordable_Units Proposed_Total_Affordable_Percentage Proposed_Residential_Parking_Spaces Existing_Total_Bedrooms Proposed_Total_Bedrooms Existing_Total_Floorspace Proposed_Total_Floorspace Latitude Longitude Tertiary_Code Unnamed: 0 Large User Deleted Acorn Category Acorn Type
count 8090.000000 8090.000000 8090.000000 8090.000000 8090.000000 8090.000000 8090.000000 8090.000000 8090.000000 8090.000000 8090.000000 8090.000000 8090.000000 8090.000000 8090.000000 8090.000000 8090.000000 8090.000000 5.446000e+03 5446.000000 5446.000000 5371.000000 5371.000000
mean 32516.166749 0.062631 0.004379 0.005513 0.001171 0.068181 0.871199 3.417676 0.861187 5.335476 3.031891 0.224104 0.172064 78.168480 25.364030 51.496133 -0.117451 3.116316 1.363628e+06 0.012119 0.013772 2.517036 23.232545
std 19303.975401 0.216708 0.087213 0.126687 0.060534 0.260717 6.556296 16.761604 8.159592 21.970297 15.430509 2.633943 2.320753 825.568495 968.783741 0.080279 0.175467 0.866653 7.220317e+05 0.109427 0.116552 1.345466 15.879226
min 4.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 51.293497 -0.498269 2.000000 2.374630e+05 0.000000 0.000000 1.000000 1.000000
25% 14920.750000 0.014000 0.000000 0.000000 0.000000 0.015000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 51.432644 -0.237186 2.000000 7.164422e+05 0.000000 0.000000 1.000000 11.000000
50% 32965.500000 0.026000 0.000000 0.000000 0.000000 0.027000 0.000000 1.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 51.499794 -0.143002 3.000000 1.403481e+06 0.000000 0.000000 2.000000 20.000000
75% 49363.500000 0.056000 0.000000 0.000000 0.000000 0.058000 1.000000 2.000000 0.000000 0.000000 2.000000 0.000000 0.000000 0.000000 0.000000 51.564126 0.020858 4.000000 2.039091e+06 0.000000 0.000000 3.000000 29.000000
max 65871.000000 7.744000 5.250000 7.012000 5.300000 8.997000 420.000000 587.000000 468.000000 100.000000 662.000000 96.000000 83.000000 39778.000000 85000.000000 51.677610 0.276986 4.000000 2.386637e+06 1.000000 1.000000 6.000000 62.000000
In [42]:
#Get Samples
#House Sample Size > 1000
#The Other AB Classes will require code for Setting Max and Mins
df_cat_house_2000Plus = df_cat_house.sample(n=3000, random_state=1)
#df_cat_house = df_cat_house.sample(n=200, random_state=1)
df_cat_house_2000Plus.shape


# 100s Image Sample Label Set
msk = np.random.rand(len(df_cat_house_2000Plus)) < 0.9
df_cat_house_2000p_train = df_cat_house_2000Plus[msk]   
df_cat_house_2000p_test = df_cat_house_2000Plus[~msk]
df_cat_house_2000p_train.shape
df_cat_house_2000p_test.shape
Out[42]:
(3000, 43)
Out[42]:
(2691, 43)
Out[42]:
(309, 43)

FLAT

In [40]:
#FLATS
#Get AB Code RD06
df_cat_flats = df_ldd_Filter1[(df_ldd_Filter1['ClassificationCode'].str.startswith('RD06'))]
df_cat_flats.shape
#df_cat_flats.describe()
Out[40]:
(16814, 43)
In [47]:
#Use LDD Criteria to Create Major Developmnet Flat Type Class
df_cat_flats_major_2000p = df_cat_flats[df_cat_flats['Proposed_Total_Residential_Units'] > 5].sample(n=3000, random_state=1)
df_cat_flats_major_2000p.shape
Out[47]:
(3000, 43)
In [48]:
# Split into Test and Train Sets:
# As mentioned earlier, the Tensorflow CNN Function will split data into Train, Validation & Test Sets
# so we need only a small amount for our Final Test Set

#Generic Type
msk = np.random.rand(len(df_cat_flats_major_2000p)) < 0.9
df_cat_flats_2000p_train = df_cat_flats_major_2000p[msk]   
df_cat_flats_2000p_test = df_cat_flats_major_2000p[~msk]

df_cat_house_2000p_train.shape
df_cat_house_2000p_test.shape

COMMERICAL

In [41]:
# RETAIL
#Get AB Code CO
df_cat_commercial_retail = df_ldd_Filter1[(df_ldd_Filter1['ClassificationCode'].str.startswith('CR'))]
df_cat_commercial_retail.shape
df_cat_commercial_retail.describe()
Out[41]:
(2789, 43)
Out[41]:
TempID Resi_Site_Prop_ha Non_Resi_Site_Prop_ha Total_Open_ Space_Exist_ha Total_Open_ Space_Prop_ha Total_Site_Area_Prop_ha Existing_Total_Residential_Units Proposed_Total_Residential_Units Proposed_TotalAffordable_Units Proposed_Total_Affordable_Percentage Proposed_Residential_Parking_Spaces Existing_Total_Bedrooms Proposed_Total_Bedrooms Existing_Total_Floorspace Proposed_Total_Floorspace Latitude Longitude Tertiary_Code Unnamed: 0 Large User Deleted Acorn Category Acorn Type
count 2789.000000 2789.000000 2789.000000 2789.000000 2789.000000 2789.000000 2789.000000 2789.000000 2789.000000 2789.000000 2789.000000 2789.000000 2789.000000 2789.000000 2789.000000 2789.000000 2789.000000 2509.000000 1.988000e+03 1988.000000 1988.000000 1918.000000 1918.000000
mean 34844.882754 0.027782 0.035648 0.001007 0.000824 0.064253 0.522768 6.490140 1.372535 2.975977 2.189674 0.641449 2.394765 526.823234 615.406956 51.504571 -0.118155 7.487047 1.375373e+06 0.037726 0.035211 2.924400 27.225235
std 19495.014154 0.113927 0.202654 0.022313 0.019074 0.243187 2.206015 29.727076 11.001792 15.478677 12.837393 18.057332 23.489690 3297.855759 4559.168684 0.069200 0.127908 1.585524 6.814646e+05 0.190582 0.184360 1.326673 14.845662
min 47.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 51.309772 -0.482140 1.000000 2.373730e+05 0.000000 0.000000 1.000000 2.000000
25% 17921.000000 0.004000 0.000000 0.000000 0.000000 0.006000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 51.460425 -0.198795 8.000000 7.164508e+05 0.000000 0.000000 2.000000 16.000000
50% 36146.000000 0.009000 0.000000 0.000000 0.000000 0.012000 0.000000 2.000000 0.000000 0.000000 0.000000 0.000000 0.000000 40.000000 0.000000 51.513693 -0.110480 8.000000 1.403419e+06 0.000000 0.000000 2.000000 20.000000
75% 52506.000000 0.017000 0.002000 0.000000 0.000000 0.028000 1.000000 4.000000 0.000000 0.000000 0.000000 0.000000 0.000000 170.000000 0.000000 51.553263 -0.046071 8.000000 2.039394e+06 0.000000 0.000000 4.000000 36.000000
max 66214.000000 3.300000 4.870000 0.747000 0.710000 4.870000 67.000000 734.000000 297.000000 100.000000 325.000000 909.000000 602.000000 82290.000000 117739.000000 51.665857 0.258354 11.000000 2.387253e+06 1.000000 1.000000 6.000000 62.000000
In [124]:
#Create Class Membership Control 
df_cat_indust['FloorSpace'] = df_cat_indust['Existing_Total_Floorspace'] + df_cat_indust['Proposed_Total_Floorspace']


#df_cat_indust_minor.shape

#df_cat_indust_major =df_cat_indust[df_cat_indust['FloorSpace'] > 1000]
#df_cat_indust_major.shape

#df_cat_indust['FloorSpace'].describe() 


max_ind = max_major_ind = max_minor_office = 1000

# Borough Selection yields less than 1000 Records, so get as many as possible
if len(df_cat_indust[df_cat_indust['FloorSpace'] < 1000]) < 1000:
    max_major_ind = len(df_cat_indust[df_cat_indust['FloorSpace'] > 1000]) 
if len(df_cat_indust[df_cat_indust['FloorSpace'] > 1000]) < 1000:
    max_minor_ind = len(df_cat_indust[df_cat_indust['FloorSpace'] < 1000]) 
if len(df_cat_indust) < 1000:
    max_ind = len(df_cat_indust) 

df_cat_indust_1000 = df_ldd_Filter1[(df_ldd_Filter1['ClassificationCode'].str.startswith('CI'))].sample(n=max_ind, random_state=1)
df_cat_indust_200 = df_ldd_Filter1[(df_ldd_Filter1['ClassificationCode'].str.startswith('CI'))].sample(n=200, random_state=1)

df_cat_indust_minor_1000 =df_cat_indust[df_cat_indust['FloorSpace'] < 1000].sample(n=max_minor_ind, random_state=1)
df_cat_indust_minor_200 =df_cat_indust[df_cat_indust['FloorSpace'] < 1000].sample(n=max_minor_ind, random_state=1)

df_cat_indust_major_1000 =df_cat_indust[df_cat_indust['FloorSpace'] > 1000].sample(n=max_major_ind, random_state=1)
df_cat_indust_major_200 =df_cat_indust[df_cat_indust['FloorSpace'] > 1000].sample(n=max_major_ind, random_state=1)
/Users/anthonysutton/ml2/env/lib/python3.6/site-packages/ipykernel_launcher.py:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.
In [ ]:
#Test Train Split
#Low Count so bump up all Train sets
#Generic Industrial
msk = np.random.rand(len(df_cat_indust_1000)) < 0.95
df_cat_indust_1000_train = df_cat_indust_1000[msk]   
df_cat_indust_1000_test = df_cat_indust_1000[~msk]

msk = np.random.rand(len(df_cat_indust_200)) < 0.95
df_cat_indust_200_train = df_cat_indust_200[msk]   
df_cat_indust_200_test = df_cat_indust_200[~msk]

#Major Industrial
msk = np.random.rand(len(df_cat_indust_major_1000)) < 0.95
df_cat_indust_major_1000_train = df_cat_indust_major_1000[msk]   
df_cat_indust_major_1000_test = df_cat_indust_major_1000[~msk]

msk = np.random.rand(len(df_cat_indust_major_200)) < 0.95
df_cat_indust_major_200_train = df_cat_indust_major_200[msk]   
df_cat_indust_major_200_test = df_cat_indust_major_200[~msk]

#Minor Industrial
msk = np.random.rand(len(df_cat_indust_minor_1000)) < 0.95
df_cat_indust_minor_1000_train = df_cat_indust_minor_1000[msk]   
df_cat_indust_minor_1000_test = df_cat_indust_minor_1000[~msk]

msk = np.random.rand(len(df_cat_indust_major_200)) < 0.95
df_cat_indust_major_200_train = df_cat_indust_major_200[msk]   
df_cat_indust_major_200_test = df_cat_indust_major_200[~msk]

Stage 3 - Saving Label Sets as Training Label Image Buckets

3 .1 - Inspect Previous Label Runs

In [ ]:
# SEE WHAT FOLDERS HAVE BEEN CREATED
import os
rootdir = '/Users/anthonysutton/ml2/_FINAL_SCRIPT_LIBRARY/LABELS/'

for subdir, dirs, files in os.walk(rootdir):
        print(os.path.join(subdir))

3.2 - Create Label Set Folder Name e.g. LABEL_RUN_1:

In [92]:
#LABEL_RUN_64
RunName = input()
LABEL_RUN_85

3.3 - Create Label Names e.g. ACORN TYPE:

In [ ]:
# For now we will use the DataFrame Name
CodeName = input()

3.4 - Select Label Members e.g. RD06 RD04:

In [ ]:
#Copy And Paste, Labels of Interest:
df_cat_house_200_train df_cat_house_200_test  df_cat_house_1000_train df_cat_house_1000_test df_cat_flats_200_train df_cat_flats_200_test  df_cat_flats_1000_train df_cat_flats_1000_test
df_cat_flats_major_1000_train df_cat_flats_major_1000_test df_cat_flats_major_200_train df_cat_flats_major_200_test df_cat_flats_minor_1000_train df_cat_flats_minor_1000_test
df_cat_flats_minor_200_train df_cat_flats_minor_200_test df_cat_flats_fuzzy_units_200_train df_cat_flats_fuzzy_units_200_test df_cat_flats_fuzzy_units_1000_train df_cat_flats_fuzzy_units_1000_test df_cat_commercial_office_200_train df_cat_commercial_office_200_test df_cat_commercial_office_1000_train df_cat_commercial_office_1000_test df_cat_commercial_office_minor_1000_train
df_cat_commercial_office_minor_1000_test df_cat_commercial_office_minor_200_train
df_cat_commercial_office_minor_200_train df_cat_commercial_office_major_1000_test df_cat_commercial_office_major_1000_train
df_cat_commercial_office_major_200_test df_cat_commercial_office_major_200_train df_cat_indust_1000_train df_cat_indust_1000_test df_cat_indust_200_train df_cat_indust_200_test
df_cat_indust_major_1000_train df_cat_indust_major_1000_test df_cat_indust_major_200_train df_cat_indust_major_200_test df_cat_indust_minor_1000_train df_cat_indust_minor_1000_test
df_cat_indust_major_200_train df_cat_indust_major_200_test df_cat_commercial_retail_1000_major_train df_cat_commercial_retail_1000_major_test df_cat_commercial_retail_200_major_test df_cat_commercial_retail_200_major_train
df_cat_commercial_retail_1000_minor_train df_cat_commercial_retail_1000_minor_test df_cat_commercial_retail_200_minor_train df_cat_commercial_retail_200_minor_test df_cat_commercial_retail_200_train df_cat_commercial_retail_200_test df_cat_commercial_retail_1000_train df_cat_commercial_retail_1000_test
In [ ]:
#Copy And Paste, Labels of Interest:
df_cat_house_200_train df_cat_flats_major_200_train df_cat_commercial_office_major_200_train df_cat_indust_major_200_train df_cat_commercial_retail_200_major_train 
In [ ]:
#Copy And Paste, Labels of Interest:
#Simple Types Run
df_cat_house_major_floorspace_test df_cat_flats_major_floorspace_test df_cat_commercial_office_major_floorspace_test df_cat_commercial_retail_major_floorspace_test df_cat_commercial_indust_major_floorspace_test df_cat_terraces_major_floorspace_test
df_cat_house_major_floorspace_train df_cat_flats_major_floorspace_train df_cat_commercial_office_major_floorspace_train df_cat_commercial_retail_major_floorspace_train df_cat_commercial_indust_major_floorspace_train df_cat_terraces_major_floorspace_train
In [ ]:
#Copy And Paste, Labels of Interest:
#Places 365 Munged Run
df_labeled_join_365munged_house_join df_labeled_join_365munged_flat_join df_labeled_join_365munged_office_join
df_labeled_join_365munged_retail_join df_labeled_join_365munged_indust_join
In [ ]:
#Copy And Paste, Labels of Interest:
# Places 365 Munged Run 2 & 3
df_labeled_join_365munged_indust_train  df_labeled_join_365munged_indust_test df_labeled_join_365munged_retail_join_train df_labeled_join_365munged_retail_join_test df_labeled_join_365munged_office_join_train df_labeled_join_365munged_office_join_test
df_labeled_join_365munged_house_join_train df_labeled_join_365munged_house_join_test df_labeled_join_365munged_flat_join_train df_labeled_join_365munged_flat_join_test 
# Places 365 Munged Run 3
df_labeled_join_365munged_terrace_join_train df_labeled_join_365munged_terrace_join_test 
In [ ]:
df_cat_flats_2000p_train = df_cat_flats_major_2000p[msk]   
df_cat_flats_2000p_test = df_cat_flats_major_2000p[~msk]

df_cat_house_2000p_train.shape
df_cat_house_2000p_test.shape
In [66]:
ValueName = input()
df_labeled_join_365munged_office_join_train df_labeled_join_365munged_office_join_test df_labeled_join_365munged_house_join_train df_labeled_join_365munged_house_join_test df_labeled_join_365munged_flat_join_train df_labeled_join_365munged_flat_join_test
In [84]:
input_list = ValueName.split() #splits the input string on spaces
# process string elements in the list and make them integers
input_list = [str(a) for a in input_list] 
In [85]:
for a in input_list:
    print(a)
df_labeled_join_365munged_office_join_train
df_labeled_join_365munged_office_join_test
df_labeled_join_365munged_house_join_train
df_labeled_join_365munged_house_join_test
df_labeled_join_365munged_flat_join_train
df_labeled_join_365munged_flat_join_test
Or get all labels from the dataframe
input_list = labels['index'] input_list
Final Check of Input Parameters

print(RunName) print(input_list)

3.5 - Copy Image Files into Label Folder Buckets

Warning: Ensure your Parameters are 100% correct before proceeding with Image File Copy.
In [93]:
# Check Sample Sizes
# Use Eval to pass in DF Name as String
[print(str(eval(x).shape) + ' - ' + str(x)) for x in input_list]
(612, 47) - df_labeled_join_365munged_office_join_train
(40, 47) - df_labeled_join_365munged_office_join_test
(621, 43) - df_labeled_join_365munged_house_join_train
(29, 43) - df_labeled_join_365munged_house_join_test
(629, 46) - df_labeled_join_365munged_flat_join_train
(39, 46) - df_labeled_join_365munged_flat_join_test
Out[93]:
[None, None, None, None, None, None]
In [ ]:
import re
import shutil
import os
#source = os.listdir('/Users/anthonysutton/ml2/_FINAL_SCRIPT_LIBRARY/DataOut/')
#destination = '/Users/anthonysutton/ml2/_FINAL_SCRIPT_LIBRARY/LABEL_RUN_1/HOUSE'
#destination = '/Users/anthonysutton/ml2/_FINAL_SCRIPT_LIBRARY/LABEL_RUN_2/HMO'

for  a in input_list:
    
    #TO DO - rename a as ValueName
    print(a)
    
    # Version 1.0 Method
    #Get the data, convert type when necessary
    #df_label_array = df_ldd_Filter1[(df_ldd_Filter1['ClassificationCode'].str.startswith(a))]
    #df_label_array = df_ldd_Filter1[(df_ldd_Filter1['ClassificationCode'] == a)]
    #df_label_array = df_ldd_Filter1[(df_ldd_Filter1[CodeName] == float(a))]
    #df_label_array = df_ldd_Filter1[(df_ldd_Filter1[CodeName] == str(a))]
    #df_label_array = df_cat_commercial_office_hybrid
    #destination1 = '/Users/anthonysutton/ml2/_FINAL_SCRIPT_LIBRARY/LABELS/' + str(RunName) + '/' + str(a) + '/SV_RUN1'
    #destination2 = '/Users/anthonysutton/ml2/_FINAL_SCRIPT_LIBRARY/LABELS/' + str(RunName) + '/' + str(a) + '/SV_RUN2'

    # Version 1.1 Method
    if re.search(r'train', str(a)):
    #Save to Test Folder    
        destination1 = '/Users/anthonysutton/ml2/_FINAL_SCRIPT_LIBRARY/LABELS/' + str(RunName) + '/TRAIN/' + str(a) 
        destination2 = '/Users/anthonysutton/ml2/_FINAL_SCRIPT_LIBRARY/LABELS/' + str(RunName) + '/TRAIN/' + str(a) 
    
    else:
    #Save to Train Folder  
        destination1 = '/Users/anthonysutton/ml2/_FINAL_SCRIPT_LIBRARY/LABELS/' + str(RunName) + '/TEST/' + str(a) 
        destination2 = '/Users/anthonysutton/ml2/_FINAL_SCRIPT_LIBRARY/LABELS/' + str(RunName) + '/TEST/' + str(a) 

    print(destination1)
    print(destination2)
    
    if not os.path.exists(destination1):
        print('create' + destination1)
        os.makedirs(destination1)
        #os.makedirs(destination2)
        #print('create' + destination2)
         
    #for index, row in df_label_array.iterrows():
    for index, row in eval(a).iterrows():
        print(eval(a).shape)
        if row['TempID'] <= 29192:
            #Bin for the image batch with sv api controls, such as radius, angle and extent
                #destination1 = '/Users/anthonysutton/ml2/_FINAL_SCRIPT_LIBRARY/LABELS/' + RunName + '/' + str(a) + '/SV_RUN1'
                #destination1 = '/Users/anthonysutton/ml2/_FINAL_SCRIPT_LIBRARY/LABELS/' + RunName + '/' + str(a) 
                if not os.path.isfile(destination1 + '/' +  str(row['TempID'])  + '.jpg'):
                    print('COPY >>> /Users/anthonysutton/ml2/_STREETVIEW_EXPLORER/LDD_Complete/col_id_' 
                                + str(row['TempID'])  + ".jpg TO " + destination1 ) 
                    shutil.copy('/Users/anthonysutton/ml2/_STREETVIEW_EXPLORER/LDD_Complete/col_id_' 
                                + str(row['TempID'])  + '.jpg',destination1)
        elif row['TempID'] > 29192:
            #Bin for the image batch with sv default api settings(50m radius)
                #destination2 = '/Users/anthonysutton/ml2/_FINAL_SCRIPT_LIBRARY/LABELS/' + RunName + '/' + str(a) + '/SV_RUN2'
                #destination2 = '/Users/anthonysutton/ml2/_FINAL_SCRIPT_LIBRARY/LABELS/' + RunName + '/' + str(a) 
                if not os.path.isfile(destination2 + '/' + str(row['TempID'])  + ".jpg"):
                    print('COPY >>> /Users/anthonysutton/ml2/_STREETVIEW_EXPLORER/LDD_Complete/batch2/col_id_' 
                            + str(row['TempID'])  + '.jpg TO ' + destination1 )
                    shutil.copy('/Users/anthonysutton/ml2/_STREETVIEW_EXPLORER/LDD_Complete/batch2/col_id_' 
                                   + str(row['TempID'])  + '.jpg',destination2)
                    
    print("files copied")

3.6 Review Image Buckets

Method 1:

Method 2:


Stage 4 - Creating Further Building Type Label Sets

The Stage 2 explored in detail above, was then repeated for the remaining CNN Classification Label sets(itemised above). Please see Script 5 for a full table of results for all Label Train and Classification Runs explored.

The Aim was to see to what extent the Inception V3 CNN Iamge Classification Architecture might be used as a tool for exploring London's urban visual grammar and in particular, contemporary and existing Building Typologies.

With this aim in mind, we set out to augment and deepen our data model with a wider range of classification descriptive approaches that may or may not have an Image Visual component.

Auxilary Labels Sets:

  • Set B: Addressbase Variations
  • Set C: CACI Acorn
  • Set D: PTAL
  • Set E: Town Planning Use Class(LBTH Only)
  • Set F: Place Pulse
  • Set G: Census Data
  • Set H: Old vs New London. Negative and Positive Sets (+/-)

WIP: For Ease of use and Re-Usability we present the above steps in a Jupyter Widget Friendly interface, which also allows the user to explore and create further label set combinations from data model.

4.1 - Addressbase Fuzzy Variations

In [ ]:
##Code Selector
addbase_name = df_ldd_Filter1.ClassificationCode.unique()
addbase_Selector =    widgets.SelectMultiple(
    options= addbase_name,
    #value='2',
    description='',
    disabled=False,
    rows=10,
    layout=Layout(width='50%', height='100%')
    )   

printmd("<div class='alert alert-block alert-info'><b>Select AddressBase Class:</b>" +
        " (ctrl+ for Multiple Select) &#9662;")
display(addbase_Selector)
printmd("</div>")
addbase_Selection = addbase_Selector.value

widgets.FloatRangeSlider(
    value=[5, 7.5],
    min=0,
    max=10.0,
    step=0.1,
    description='Range:',
    disabled=False,
    continuous_update=False,
    orientation='vertical',
    readout=True,
    readout_format='.1f',
)   
In [ ]:
df_cat_commercial_office = df_ldd_Filter1[(df_ldd_Filter1['ClassificationCode'].str.startswith('CO'))]

#df_cat_commercial_office['Total_Site_Area_Prop_ha'].describe()

df_cat_commercial_office['Total_Site_Area_Prop_ha'].describe(percentiles=[.10, .30, .50, .70])

df_cat_commercial_office_hybrid =  df_cat_commercial_office[df_cat_commercial_office['Total_Site_Area_Prop_ha']> .04]
df_cat_commercial_office_hybrid.head(100)

# RETAIL
#Step 1 - GeT Add Base Code CR
df_cat_commercial_retail_hybrid = df_ldd_Filter1[(df_ldd_Filter1['ClassificationCode'].str.startswith('CR'))]

df_cat_commercial_retail_hybrid.shape
df_cat_commercial_retail_hybrid.head(100)

df_cat_commercial_retail_hybrid['Total_Site_Area_Prop_ha'].describe(percentiles=[.10, .30, .50, .70])

df_cat_commercial_retail_hybrid.shape
checkout =  df_cat_commercial_retail_hybrid[df_cat_commercial_retail_hybrid['Total_Site_Area_Prop_ha'] < .008]
checkout.shape
check.head(10)

                                                             

4.2 - CACI Acorn Geo Demographic Classes

In [ ]:
#caci = df_ldd_Filter1['Acorn Category'].unique()
caci_labels = df_ldd_Filter1['Description'].value_counts().reset_index(name="count").query("count > 100")
caci_labels

4.3 - PTAL Zone

In [ ]:
#df_ldd_Filter1['PTAL'].unique()
labels = df_ldd_Filter1['PTAL'].value_counts().reset_index(name="count").query("count > 100")
labels

4.4 Place Pulse

The Place Pulse MIT Project quantitatively measures urban perception by crowdsourcing visual surveys to users around the globe.

http://pulse.media.mit.edu/vision/

4.5 UK Census - Wards

Does a Local Area have CNN discernable visual characterstics?

See Paper - What makes Paris look like Paris - Dorsch et al? https://dl.acm.org/citation.cfm?id=2830541

In [ ]:
#df_ldd_Filter1['Ward'].unique()
labels = df_ldd_Filter1['Ward'].value_counts().reset_index(name="count").query("count > 100")
labels

4.6 UK Town Planning Use Classes

4.7 - Old and New London

In [ ]:
# Get Image Names from the Non LDD Acorn Street View Download Run 
final_acorn_label = pd.read_csv('/Users/anthonysutton/ml2/_ACORN_EXPLORER/final_acorn_label_upload9_noblank.csv',  sep=',', 
                           error_bad_lines=False, index_col=False,  na_values=['.'], encoding="ISO-8859-1")
df_non_ldd.rename(columns={'Postcode':'Postcode_Join'}, inplace=True)
In [ ]:
#Join
df_ldd_acorn_label_join = pd.merge(df_non_ldd, final_acorn_label, on='Postcode_Join' , how='left')
In [ ]:
#df_non_ldd = pd.read_csv('/Users/anthonysutton/ml2/_DATA_LABELS/LDD_STORE/LDD_MODEL/nonLDD.csv',  sep=',', 
#                           error_bad_lines=False, index_col=False,  
#                         usecols = ['Field1', 'Postcode', 'SubBuildingName', 'BuildingName',   'ClassificationCode'], 
#                         na_values=['.'], encoding="ISO-8859-1")

Summary

  • Fuzzy Set Limitations
  • Patchy Set Data Limitations.
  • Object Recognition vs Scene Classification (e.g. Delf) vs Semantic Segmentation
  • Biblio: Set Theory, Fuzzy Logic
In [ ]:
Fuzzy_Selector =  widgets.FloatRangeSlider(
    value=[5, 7.5],
    min=0,
    max=10.0,
    step=0.1,
    description='Range:',
    disabled=False,
    continuous_update=False,
    orientation='horizontal',
    readout=True,
    readout_format='.1f',
)

printmd("<div class='alert alert-block alert-info'><b>Fuzzy Range Controller:</b>  &#9662;")


display(Fuzzy_Selector)

In [6]:
Fuzzy_Selection = Fuzzy_Selector.value
printmd("<div class='alert alert-block alert-info'>Selected Fuzzy Control Range = <b> " + str(Fuzzy_Selection) + "</b></div>")
Selected Fuzzy Control Range = (5.0, 7.5)